Part Ⅳ
20 Tries
20.1 R-Way Tries
Tries (from retrieval, but pronounced "try"): Store characters in nodes (not keys), each node has
Trie Search
Follow links corresponding to each character in the key.
Search hit: node where search ends has a non-null value.
Search miss: reach null link or node where search ends has null value.
Trie Delete
Find the node corresponding to key and set value to null.
If node has null value and all null links, remove that node (and recur).
20.2 Ternary Search Tries
Ternary Search Trees
Store characters and values in nodes (not keys).
Each node has 3 children: smaller (left), equal (middle), larger (right).
TST Search
Follow links corresponding to each character in the key.
If less, take left link; if greater, take right link.
If equal, take the middle link and move to the next key character.
Search hit & miss:
Search hit: Node where search ends has a non-null value.
Search miss: Reach null link or node where search ends has null value.
TST with
Do
-way branching at root. Each of
root nodes points to a TST.
TST vs. Hashing
TSTs | Hashing |
---|---|
Works only for strings (or digital keys) | Need to examine entire key |
Only examines just enough key characters | Search hits and misses cost about the same |
Search miss may involve only a few characters | Performance relies on hash function |
Supports ordered symbol table operations (plus others!) | Does not support ordered symbol table operations |
Implementation | Character Accesses (typical case) | |||
Search Hit | Search Miss | Insert | Space (references) | |
21 Substring Search
21.1 Introduction
Goal: Find pattern of length
Applications:
Find & replace
Computer forensics
Identify patterns indicative of spam
Electronic surveillance
Screen scraping
21.2 Brute-Force Substring Search
Disadvantages
Theoretical challenge: Linear -time guarantee (Worst case:
). Practical challenge: Avoid backup in text stream (Brute-force algorithm needs backup for every mismatch).
21.3 Knuth-Morris-Pratt
21.3.1 Proposition
Property: KMP substring search accesses no more than
Proof: Each pattern char accessed once when constructing DFA; each text char accessed once (in the worst case) when simulating DFA.
Property: KMP constructs dfa[][]
in time and space proportional to
21.3.2 DFA
Deterministic Finite State Automaton (DFA) is an abstract string-search machine.
Finite number of states (including start and halt).
Exactly one transition for each char in alphabet.
Accept if sequence of transitions lead to halt state.
DFA Construction
If in state
(first characters of pattern have already been matched and next char c == pat.charAt(j)
(next char matches), go to(now first characters of pattern have been matched). If in state
and next char c != pat.charAt(j)
, then the lastcharacters of input are pat[1...j - 1]
, followed by c. Simulatepat[1...j - 1]
on DFA and take transition c (only longest possible matched suffix now liespat[1...j - 1]
followed by c).
DFA Construction for Code
Copy
dfa[][X]
todfa[][j]
for mismatch case.Set
dfa[pat.charAt(j)][j]
tofor match case. Update
.
21.3.3 NFA
NFA Construction
Use pointer j for comparison.
If i = 0, next[i] = -1.
If pat[i] != pat[j], it means current state j is possible.
If pat[i] == pat[j], it means current state j is impossible, roll back.
21.4 Boyer-Moore
Boyer-Moore
Scan characters in pattern from right to left.
Can skip as many as
text chars when finding one not in the pattern.
How much to skip?
Mismatch character not in pattern.
Mismatch character in pattern.
Mismatch character in pattern (but heuristic no help).
Property: Substring search with the Boyer-Moore mismatched character heuristic takes about
Worst Case: Can be as bad as
Boyer-Moore variant: Can improve worst case to
21.5 Rabin-Karp
Rabin-Karp (Modular Hashing)
Compute a hash of pattern characters
to . For each
, compute a hash of text characters to . If pattern hash = text substring hash, check for a match.
Modular Hashing Function: Using the notation txt.charAt(i)
, we wish to compute:
M-digit, base-R integer, modulo Q.
Based on the function above, we can get:
Cost of searching for an
Algorithm | Version | Operation Count | Backup in Input? | Correct? | Extra Space | |
Guarantee | Typical | |||||
- | yes | yes | ||||
full DFA | no | yes | ||||
mismatch transitions only | no | yes | ||||
full algorithm | yes | yes | ||||
mismatched char heuristic only | yes | yes | ||||
Rabin-Karp* | Monte Carlo | no | yes* | |||
Las Vegas | yes | yes |
*: probabilisitic guarantee, with uniform hash function
22 Regular Expressions
22.1 Regular Expressions
Pattern Searching: Find one of a specified set of strings in text.
Applications
Genomics: test for certain pattrn of base sequence
Syntax highlighting
Google code search
Scan for virus signatures
Process natural language
Specify a programming language
Access information in digital libraries
Search genome using PROSITE patterns
Filter text (spam, NetNanny, Carnivore, malware)
Validate data-entry fields (dates, email, URL, credit card)
Compile a Java program
Crawl and index the Web
Read in data stored in ad hoc input file format
Create Java documentation from Javadoc comments
...
Regular Expressions: A notation to specify a set of strings.
Operation | Order | Example RE | Matches | Does not Match |
---|---|---|---|---|
Concatenaion | 3 | AABAAB | AABAAB | every other string |
Or | 4 | AA|BAAB | AA BAAB | every other string |
Closure | 2 | AB*A | AA ABA ABBA ABBBBBBBBA | AB ABABA |
Parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABABA | AA ABBA |
Shortcuts
Operation | Example RE | Matches | Does not Match |
---|---|---|---|
Wildcard | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
Character Class | [A-Za-z][a-z]* | Word Capitalized | camelCase 4illegal |
At Least 1 | A(BC)+DE | ABCDE ABCBCDE | ADE BCDE |
Exactly k | [0-9]{5}-[0-9]{4} | 08540-1321 19072-5541 | 111111111 166-54-111 |
Examples
Regular Expression | Matches | Does not Match |
---|---|---|
.*SPB.* (substring search) | RASPBERRY CRISPBREAD | SUBSPACE SUBSPECIES |
[0-9]{3}-[0-9]{2}-[0-9]{4} (U.S. Social Security numbers) | 166-11-4433 166-45-1111 | 11-55555555 8675309 |
[a-z]+@([a-z]+\.)+(edu|com) (simplified email addresses) | wayne@princeton.edu rs@princeton.edu | spam@nowhere |
[$_A-Za-z][$_A-Za-z0-9]* (Java identifiers) | ident3 PatternMatcher | 3a ident#3 |
Caveat
Writing a RE is like writing a program.
Need to understand programming model.
Can be easier to write than read.
Can be difficult to debug.
22.2 REs and NFAs
Kleene's theorem
For any DFA, there exists a RE that describes the same set of strings.
For any RE, there exists a DFA that recognizes the same set of strings.
Regular-expression-matching NFA:
We assume RE enclosed in parentheses.
One state per RE character (start = 0, accept = M).
Red ε-transition (change state, but don't scan text).
Black match transition (change state and scan to next text char).
Accept if any sequence of transitions ends in accept state after scanning all text characters.
NFA Simulation
Check whether input matches all possible states that NFA could be.
Reject otherwise.
Construction
Parenthesis: Add ε-transition edge from parentheses to next state.
Closure: Add three ε-transition edges for each * operator.
Or: Add two ε-transition edges for each | operator.
NFA Construction
Left parenthesis
Add ε-transition to next state.
Push index of state corresponding to ( onto stack.
Alphabet symbol
Add match transition to next state.
Do one-character lookahead: add ε-transition if next character is *.
Or symbol
Push index of state corresponding to | onto stack.
Right parenthesis
Add ε-transition to next state.
Pop correponding ( and possibly intervening |; add ε-transition edges for or.
Do one-character lookahead: add ε-transition if next character is *.
Property: Determining whether an
Proof: For each of the
Property: Building the NFA corresponding to an
Proof: For each of the
Directed DFS Implementation
NFA Implementation
23 Data Compression
23.1 Data Compression Introduction
Application
Generic file compression
Files: GZIP, BZIP, 7z
Archivers: PKZIP
File systems: NTFS, HFS+, ZFS
Multimedia
Images: GIF, JPEG
Sound: MP3
Video: MPEG, DivX™, HDTV
Communication
ITU-T T4 Group 3 Fax
V.42bis modem
Skype
23.2 Run-Length Coding
Example
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
4-bit counts to represent alternating runs of 0s and 1s: 15 0s, then 7 1s, then 7 0s, then 11 1s.
Applications: JPEG, ITU-T T4 Group 3 Fax, ...
23.3 Huffman Coding
Inorder to produce prefix-free code, we need to ensure that no codeword is a prefix of another.
23.4 LZW Coding
24 Reductions
24.1 Introduction
Reduction: Problem
Cost of solving
Example 1: Finding the median reduces to sorting
Find the media of N items
Sort
items. Return item in the middle.
Cost of solving this problem:
Example 2: Element distinctness reduces to sorting
Element distinctness on N items
Sort
items. Check adjacent pairs for equality.
Cost of solving this problem:
24.2 Designing Algorithms
Examples
3-collinear reduces to sorting.
Finding the median reduces to sorting.
Element distinctness reduces to sorting.
CPM reduces to topological sort.
Arbitrage reduces to shortest paths.
Burrows-Wheeler transform reduces to suffix sort.
For more examples on algorithm designing using reductions, please visit convex hull or shortest path.
24.3 Establishing Lower Bounds
Very difficult to establish lower bounds from scratch => Use reductions
Definition: Problem
Linear number of standard computational steps.
Constant number of calls to
.
Property: In quadratic decision tree model, any algorithm for sorting
Proof
Sorting instance:
, , ..., Convex hull instance:
, , ..., Region
is convex => all points are on hull. Starting at point with most negative
, counterclockwise order of hull points yields integers in ascending order.
24.4 Classifying Problems
Goal: Classyify problem with algorithm that matches lower bound.
Integer Arithmetic Reductions
Goal: Given two
Problem | Arithmetic | Order of Growth |
---|---|---|
Integer Multiplication | M(N) | |
Integer Division | M(N) | |
Integer Square | M(N) | |
Integer Square Root | M(N) |
Integer arithmetic problems with the same complexity as integer multiplication.
Matrix Multiplication
Goal: Given two
Problem | Linear Algebra | Order of Growth |
---|---|---|
Matrix Multiplication | MM(N) | |
Matrix Inversion | MM(N) | |
Determinant | MM(N) | |
System of Linear Equations | MM(N) | |
LU Decomposition | MM(N) | |
Least Squares | min | MM(N) |
Numerical linear algebra problems with the same complexity as matrix multiplication.
25 Linear Programming
Linear Programming: A problem-solving model for optimal allocation of scarce resources.
Agriculture: Diet problem
Computer science: Compiler register allocation, data mining
Electrical engineering: VLSI design, optimal clocking
Energy: Blending petroleum products
Economics: Equilibrium theory, two-person zero-sum games
Environment: Water quality management
Finance: Portfolio optimization
Logistics: Supply-chain management
Management: Hotel yield management
Marketing: Direct mail advertising
Manufacturing: Production line balancing, cutting stock
Medicine: Radioactive seed placement in cancer treatment
Operations research: Airline crew assignment, vehicle routing
Physics: Ground states of 3-D Ising spin glasses
Telecommunication: Network design, Internet routing
Sports: Scheduling ACC basketball, handicapping horse races
25.1 Brewer's Problem
Brewer's problem: Small brewery produces ale and beer, Production limited by scarce resources: corn, hops, barley malt.
$13 profit per barrel: 5 pounds corn, 4 ounces hops, 35 pounds halt
$23 profit per barrel: 15 pounds corn, 4 ounces hops, 20 pounds halt
corn: 480 lbs
hops: 160 oz
malt: 1190 lbs
Now we have:
Standard form linear program
Extreme point property: If there exists an optimal solution to (P), then there exists one that is an extreme point.
25.2 Simplex Algorithm
Generic Implementation
Start at some extreme point.
Pivot from one extreme point to an adjacent one.
Repeat until optimal.
Simplex Algorithm - Basic feasible solution
Set
nonbasic variables to , solve for remaining variables. Solve
equations in unknowns. If unique and feasible => BFS
BFS <=> extreme point
Recall:
Initialize:
Pivot 1: Substitute y = (1/15) (480 – 5A – SC) and add y into the basis (rewrite 2nd equation, eliminate y in 1st, 3rd, and 4th equations)
basis = {B, S H, S M}
x = S C = 0
Z = 736
y = 32
S H = 32
S M = 550
Q: Why pivot on column 2 (corresponding to variable
)? A:
Its objective function coefficient is positive.
(each unit increase in
from increases objective value by ) Pivoting on column 1 (corresponding to
) also OK.
Q: Why pivot on row 2?
A:
Preserves feasibility by ensuring
. Minimum ratio rule: min {
, , }
Pivot 2: Substitute x = (3/8) (32 + (4/15) S C – S H) and add x into the basis (rewrite 3rd equation, eliminate x in 1st, 2nd, and 4th equations)
basis = { x, y, S M}
S C = S H = 0
Z = 800
y = 28
x = 12
S M = 110
Q: When to stop pivoting?
A: When no objective function coefficient is positive.
Q: Why is resulting solution optimal?
A: Any feasible solution satisfies current system of equations.
In particular:
Thus, optimal objective value
since . Current BFS has value
=> optimal.
25.3 Implementation
Bland's Rule: Find entering column
Min-ratio Rule: Find leaving row
Pivot: Pivot on element row
Property: In typical practical applications, simplex algorithm terminates after at most
30 Catalan Number
30.1 Properties and Formulas
30.2 Applications
It is the number of expressions containing
pairs of parentheses which are correctly matched. For
, for example: ((())), (()()), (())(), ()(()), ()()().
It is the number of different ways
factors can be completely parenthesized (or the number of ways of associating applications of a binary operator, as in the matrix chain multiplication problem). For
, for example: ((ab)c)d, (a(bc))d, (ab)(cd), a((bc)d), a(b(cd)).
It is the number of full binary trees with
leaves , or, equivalently, with a total of internal nodes. For
, for example: It is the number of structurally unique BSTs (binary search trees) which has exactly
nodes of unique values from to . For
, for example: It is the number of Dyck words of length
. A Dyck word is a string consisting of X's and Y's such that no initial segment of the string has more Y's than X's. For example, Dyck words for
: XXXYYY XYXXYY XYXYXY XXYYXY XXYXYY
It is the number of monotonic lattice paths along the edges of a grid with
square cells, which do not pass above the diagonal. A convex polygon with
sides can be cut into triangles by connecting vertices with non-crossing line segments (a form of polygon triangulation). The number of triangles formed is and it is the number of different ways that this can be achieved.