- Understand how KMP builds the prefix table to avoid redundant comparisons
- See why KMP is O(n+m) while naive is O((n-m+1)×m)
- Learn when to use KMP (guaranteed linear time, streaming data)
KMP Algorithm
The naive string search rechecks characters you already know. That's the waste. KMP (Knuth-Morris-Pratt) fixes it.
Here's the problem naive has: you're searching for "AABA" in "AABAACAABAA".
Text: A A B A A C A A B A A
Pattern: A A B A
Position 0:
A==A, A==A, B==B, A==A → Found!
Position 1 (naive restarts here):
A!=A (text[1] vs pattern[0])
Position 2:
B!=A
Position 3:
A==A, A!=A (text[4] vs pattern[1])
At position 3, you check A==A, then A!=A (because text[4]='A' but pattern[1]='A'... wait, that matches). Let me trace more carefully:
Position 0: A A B A matches → Found at 0!
Position 1: A!=A (we're comparing text[1]='A' with pattern[0]='A', they match)
Actually: A==A, A!=B (text[2]='B', pattern[1]='A')
Position 2: B!=A
Position 3: A==A, A!=A (text[4]='A', pattern[1]='A', they match)
A==A, A==A, C!=B (text[5]='C', pattern[2]='B')
... and so on
Every time we mismatch, naive throws away information and starts over. But when we mismatch, we know what we just compared. KMP remembers this.
The Prefix Table
KMP builds a table that answers: "I just mismatched at position j in the pattern. Where should I resume?"
For "AABA":
Pattern: A A B A
Index: 0 1 2 3
π table: 0 1 0 1
What does π[i] mean? It's the length of the longest proper prefix of pattern[0...i] that's also a suffix.
For "AABA":
- At index 0 ('A'): no proper prefix/suffix → π[0]=0
- At index 1 ('AA'): prefix='A', suffix='A', length=1 → π[1]=1
- At index 2 ('AAB'): no prefix=suffix → π[2]=0
- At index 3 ('AABA'): prefix='A', suffix='A', length=1 → π[3]=1
Why does this help? When we mismatch at position j, we've already matched j characters. π[j-1] tells us how many characters at the start of the pattern match the end of what we just matched. We can skip ahead by reusing that knowledge.
Building the Prefix Table
def compute_prefix_table(pattern):
m = len(pattern)
pi = [0] * m
j = 0 # Length of previous longest prefix-suffix
for i in range(1, m):
# Handle mismatch: fall back using the table
while j > 0 and pattern[i] != pattern[j]:
j = pi[j - 1]
# If match, extend the prefix-suffix
if pattern[i] == pattern[j]:
j += 1
pi[i] = j
return pi
Let's trace this for "AABAACAABAA":
i=1: pattern[1]='A' == pattern[0]='A', j=1, π[1]=1
i=2: pattern[2]='B' != pattern[1]='A', j=0, π[2]=0
i=3: pattern[3]='A' == pattern[0]='A', j=1, π[3]=1
i=4: pattern[4]='A' == pattern[1]='A', j=2, π[4]=2
i=5: pattern[5]='C' != pattern[2]='B', fallback: j=π[1]=1
pattern[5]='C' != pattern[1]='A', fallback: j=π[0]=0
pattern[5]='C' != pattern[0]='A', j=0, π[5]=0
i=6: pattern[6]='A' == pattern[0]='A', j=1, π[6]=1
i=7: pattern[7]='A' == pattern[1]='A', j=2, π[7]=2
i=8: pattern[8]='B' == pattern[2]='B', j=3, π[8]=3
i=9: pattern[9]='A' == pattern[3]='A', j=4, π[9]=4
i=10: pattern[10]='A' == pattern[4]='A', j=5, π[10]=5
Final: [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
The Search
Now we use the prefix table to search:
def kmp_search(text, pattern):
n = len(text)
m = len(pattern)
pi = compute_prefix_table(pattern)
j = 0 # Number of characters matched
matches = []
for i in range(n):
# Handle mismatch: use prefix table to skip
while j > 0 and text[i] != pattern[j]:
j = pi[j - 1]
# If match, advance in pattern
if text[i] == pattern[j]:
j += 1
# Found complete match
if j == m:
matches.append(i - m + 1)
j = pi[j - 1] # Continue searching
return matches
Trace through searching for "AABA" in "AABAACAADAABAABA":
i=0: text[0]='A' == pattern[0]='A', j=1
i=1: text[1]='A' == pattern[1]='A', j=2
i=2: text[2]='B' == pattern[2]='B', j=3
i=3: text[3]='A' == pattern[3]='A', j=4 → Match at position 0!
j = π[3] = 1 (we know 'A' matches, continue from there)
i=4: text[4]='A' == pattern[1]='A', j=2
i=5: text[5]='C' != pattern[2]='B', j = π[1] = 1
text[5]='C' != pattern[1]='A', j = π[0] = 0
text[5]='C' != pattern[0]='A', j=0
i=6: text[6]='A' == pattern[0]='A', j=1
i=7: text[7]='A' == pattern[1]='A', j=2
i=8: text[8]='D' != pattern[2]='B', j = π[1] = 1
text[8]='D' != pattern[1]='A', j = π[0] = 0
text[8]='D' != pattern[0]='A', j=0
i=9: text[9]='A' == pattern[0]='A', j=1
i=10: text[10]='A' == pattern[1]='A', j=2
i=11: text[11]='B' == pattern[2]='B', j=3
i=12: text[12]='A' == pattern[3]='A', j=4 → Match at position 9!
j = π[3] = 1
i=13: text[13]='A' == pattern[1]='A', j=2
i=14: text[14]='B' == pattern[2]='B', j=3
i=15: text[15]='A' == pattern[3]='A', j=4 → Match at position 12!
Matches: [0, 9, 12]
The key: when we mismatch at position i after matching j characters, we don't restart from position 0 in the pattern. We jump to position π[j-1] because we know those characters already match.
Why This is O(n+m)
Building the prefix table visits each pattern character once: O(m).
Searching advances through the text once: the outer loop is O(n). The inner while loop might seem problematic, but j can only decrease as many times as it increased, and it increases at most n times. Total: O(n).
Together: O(n+m).
Compare to naive O((n-m+1) × m). For n=1,000,000 and m=1000, that's 999 billion versus 1 million operations.
The Insight
KMP exploits pattern structure. When we've matched "AAB" and then mismatch, we know the last two characters of what we matched were "AB". If the pattern starts with "AB", we can skip ahead. If it doesn't, we might still skip partially.
The prefix table precomputes this information so we don't waste time during the search.
Think of it like reading a book and recognizing you've seen this paragraph before. You don't re-read it word by word; you skip to where the new content starts.
When KMP Shines
Guaranteed linear time: No worst case degradation like some other algorithms.
No backtracking in text: Critical for streaming data where you can't go backward.
Repeated searches: Build the prefix table once, search many times.
The Downside
For short patterns or small alphabets, the overhead of building the prefix table might not be worth it. Naive search can be faster for m < 10 or so.
Boyer-Moore is often faster in practice for longer patterns because it can skip characters, not just avoid backtracking.
But when you need guaranteed linear time, KMP delivers.
