- Understand why naive string matching is O((n-m+1) × m) and why that's too slow
- Recognize how better algorithms reuse information instead of throwing it away
- Learn the three main approaches: KMP, Rabin-Karp, and Boyer-Moore
- See where string algorithms matter: code editors, DNA analysis, search engines
Introduction to String Algorithms
Here's a problem you've probably encountered: you hit Ctrl+F in a document and search for a word. How does the computer find it? The naive way checks every position—start at the beginning, compare characters one by one, move forward, repeat. For a document with 100,000 characters and a 10-character search term, that could mean a million character comparisons.
String algorithms make this faster. Much faster.
The Pattern Matching Problem
You have a text (like a book) and a pattern (like a word you're searching for). Find where the pattern appears in the text.
For example, find "bad" in "not bad at all":
Text: n o t b a d a t a l l
Pattern: b a d
Check position 0: n o t ≠ b a d
Check position 1: o t ≠ b a d
...
Check position 4: b a d = b a d ✓
The naive approach: start at every position in the text, compare the pattern character by character.
def naive_search(text, pattern):
n = len(text)
m = len(pattern)
for i in range(n - m + 1):
match = True
for j in range(m):
if text[i + j] != pattern[j]:
match = False
break
if match:
return i # Found at position i
return -1 # Not found
For text length n and pattern length m, this does O((n-m+1) × m) comparisons in the worst case.
Let's trace through searching for "AAA" in "AAAAA":
Position 0: A A A vs A A A ✓ Found!
Looks fast. But worst case, like "AAB" in "AAAAAAAB":
Position 0: A A A vs A A B ✗ (checked 3 chars)
Position 1: A A A vs A A B ✗ (checked 3 chars)
Position 2: A A A vs A A B ✗ (checked 3 chars)
Position 3: A A A vs A A B ✗ (checked 3 chars)
Position 4: A A A vs A A B ✗ (checked 3 chars)
Position 5: A A B vs A A B ✓ (checked 3 chars)
For n=1,000,000 and m=1000, that's potentially a billion comparisons.
Can We Do Better?
The key insight: when we find a mismatch, we're throwing away information. We know what characters we just compared. Can we use that?
Think about searching for "AAB" in "AAAAAAB". When we mismatch at position 2 (expecting 'B', found 'A'), we know the next two characters are 'A' too (because we just checked them). We could skip ahead instead of rechecking.
That's what better algorithms do. They remember information about the pattern or use clever tricks to skip unnecessary comparisons.
The Main Algorithms
KMP (Knuth-Morris-Pratt): Builds a "prefix table" that tells us where to restart the pattern after a mismatch. Never backtracks in the text. O(n+m) time.
Rabin-Karp: Uses hashing. Converts the pattern to a number, then checks if any substring of the text hashes to the same value. Average O(n+m), but can degenerate to O(nm) with collisions.
Boyer-Moore: Scans the pattern from right to left and uses two rules to skip positions. Often the fastest in practice for long patterns.
String Structures
Beyond searching, we need structures for storing and querying many strings efficiently.
Trie: A tree where each path from root to leaf represents a string. Perfect for autocomplete—type "alg" and it quickly finds all words starting with "alg".
Suffix Tree/Array: Stores all suffixes of a string in a searchable structure. Used for finding repeated substrings, longest common substring, and other advanced queries.
Where This Matters
Your code editor's find function uses these algorithms. When you search across thousands of files, naive search would be painfully slow.
Google searches billions of web pages. They use highly optimized string matching (along with indexing) to return results in milliseconds.
Bioinformatics compares DNA sequences (strings of A, T, C, G). The human genome is 3 billion base pairs. Efficient string algorithms make genetic analysis feasible.
Plagiarism detection compares documents to find copied text. This requires finding common substrings efficiently.
What Makes a String Algorithm Good?
Time complexity: How does runtime grow with input size? O(n) is ideal, O(n log n) is acceptable, O(n²) is often too slow.
Space complexity: Some algorithms trade space for speed. A prefix table uses O(m) extra space but makes searching faster.
Practical performance: Big-O notation ignores constants. In practice, an O(n log n) algorithm with small constants might beat an O(n) algorithm with large ones.
Simplicity: Simpler algorithms are easier to implement correctly and maintain. Sometimes naive is good enough.
The Pattern
Most string algorithms exploit structure or remember information:
Structure exploitation: Boyer-Moore uses the pattern's structure to skip positions. If we're looking for "PATTERN" and the text has 'Z', we can skip the entire pattern length because 'Z' doesn't appear in it.
Information reuse: KMP remembers what matched before the mismatch. Instead of starting over, it jumps to the right position in the pattern.
Preprocessing: Many algorithms spend time preprocessing the pattern to make searching faster. Build a data structure once, search many times.
The beauty of string algorithms is they make the impossible practical. Searching through gigabytes of text in seconds. Finding patterns in DNA. Making your code editor responsive.
And it all starts with recognizing that the naive approach throws away too much information.
Web forms use string algorithms to validate email addresses, phone numbers, and other structured text.
Getting Started
String algorithms build upon fundamental computer science concepts like arrays, hashing, and dynamic programming. Understanding these algorithms will give you powerful tools for text processing and pattern recognition.
In the next lessons, we'll explore specific algorithms that solve these problems efficiently, starting with the KMP algorithm for pattern matching.
