KMP Algorithm

The naive string search rechecks characters you already know. That's the waste. KMP (Knuth-Morris-Pratt) fixes it.

Here's the problem naive has: you're searching for "AABA" in "AABAACAABAA".

Text:    A A B A A C A A B A A
Pattern: A A B A

Position 0:
  A==A, A==A, B==B, A==A → Found!
  
Position 1 (naive restarts here):
  A!=A (text[1] vs pattern[0])
  
Position 2:
  B!=A
  
Position 3:
  A==A, A!=A (text[4] vs pattern[1])

At position 3, you check A==A, then A!=A (because text[4]='A' but pattern[1]='A'... wait, that matches). Let me trace more carefully:

Position 0: A A B A matches → Found at 0!

Position 1: A!=A (we're comparing text[1]='A' with pattern[0]='A', they match)
  Actually: A==A, A!=B (text[2]='B', pattern[1]='A')
  
Position 2: B!=A

Position 3: A==A, A!=A (text[4]='A', pattern[1]='A', they match)
  A==A, A==A, C!=B (text[5]='C', pattern[2]='B')
  
... and so on

Every time we mismatch, naive throws away information and starts over. But when we mismatch, we know what we just compared. KMP remembers this.

The Prefix Table

KMP builds a table that answers: "I just mismatched at position j in the pattern. Where should I resume?"

For "AABA":

Pattern: A A B A
Index:   0 1 2 3
π table: 0 1 0 1

What does π[i] mean? It's the length of the longest proper prefix of pattern[0...i] that's also a suffix.

For "AABA":

At index 0 ('A'): no proper prefix/suffix → π[0]=0
At index 1 ('AA'): prefix='A', suffix='A', length=1 → π[1]=1
At index 2 ('AAB'): no prefix=suffix → π[2]=0
At index 3 ('AABA'): prefix='A', suffix='A', length=1 → π[3]=1

Why does this help? When we mismatch at position j, we've already matched j characters. π[j-1] tells us how many characters at the start of the pattern match the end of what we just matched. We can skip ahead by reusing that knowledge.

Building the Prefix Table

def compute_prefix_table(pattern):
    m = len(pattern)
    pi = [0] * m
    j = 0  # Length of previous longest prefix-suffix
    
    for i in range(1, m):
        # Handle mismatch: fall back using the table
        while j > 0 and pattern[i] != pattern[j]:
            j = pi[j - 1]
        
        # If match, extend the prefix-suffix
        if pattern[i] == pattern[j]:
            j += 1
        
        pi[i] = j
    
    return pi

Let's trace this for "AABAACAABAA":

i=1: pattern[1]='A' == pattern[0]='A', j=1, π[1]=1
i=2: pattern[2]='B' != pattern[1]='A', j=0, π[2]=0
i=3: pattern[3]='A' == pattern[0]='A', j=1, π[3]=1
i=4: pattern[4]='A' == pattern[1]='A', j=2, π[4]=2
i=5: pattern[5]='C' != pattern[2]='B', fallback: j=π[1]=1
     pattern[5]='C' != pattern[1]='A', fallback: j=π[0]=0
     pattern[5]='C' != pattern[0]='A', j=0, π[5]=0
i=6: pattern[6]='A' == pattern[0]='A', j=1, π[6]=1
i=7: pattern[7]='A' == pattern[1]='A', j=2, π[7]=2
i=8: pattern[8]='B' == pattern[2]='B', j=3, π[8]=3
i=9: pattern[9]='A' == pattern[3]='A', j=4, π[9]=4
i=10: pattern[10]='A' == pattern[4]='A', j=5, π[10]=5

Final: [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

The Search

Now we use the prefix table to search:

def kmp_search(text, pattern):
    n = len(text)
    m = len(pattern)
    pi = compute_prefix_table(pattern)
    
    j = 0  # Number of characters matched
    matches = []
    
    for i in range(n):
        # Handle mismatch: use prefix table to skip
        while j > 0 and text[i] != pattern[j]:
            j = pi[j - 1]
        
        # If match, advance in pattern
        if text[i] == pattern[j]:
            j += 1
        
        # Found complete match
        if j == m:
            matches.append(i - m + 1)
            j = pi[j - 1]  # Continue searching
    
    return matches

Trace through searching for "AABA" in "AABAACAADAABAABA":

i=0: text[0]='A' == pattern[0]='A', j=1
i=1: text[1]='A' == pattern[1]='A', j=2
i=2: text[2]='B' == pattern[2]='B', j=3
i=3: text[3]='A' == pattern[3]='A', j=4 → Match at position 0!
     j = π[3] = 1 (we know 'A' matches, continue from there)

i=4: text[4]='A' == pattern[1]='A', j=2
i=5: text[5]='C' != pattern[2]='B', j = π[1] = 1
     text[5]='C' != pattern[1]='A', j = π[0] = 0
     text[5]='C' != pattern[0]='A', j=0

i=6: text[6]='A' == pattern[0]='A', j=1
i=7: text[7]='A' == pattern[1]='A', j=2
i=8: text[8]='D' != pattern[2]='B', j = π[1] = 1
     text[8]='D' != pattern[1]='A', j = π[0] = 0
     text[8]='D' != pattern[0]='A', j=0

i=9: text[9]='A' == pattern[0]='A', j=1
i=10: text[10]='A' == pattern[1]='A', j=2
i=11: text[11]='B' == pattern[2]='B', j=3
i=12: text[12]='A' == pattern[3]='A', j=4 → Match at position 9!
      j = π[3] = 1

i=13: text[13]='A' == pattern[1]='A', j=2
i=14: text[14]='B' == pattern[2]='B', j=3
i=15: text[15]='A' == pattern[3]='A', j=4 → Match at position 12!

Matches: [0, 9, 12]

The key: when we mismatch at position i after matching j characters, we don't restart from position 0 in the pattern. We jump to position π[j-1] because we know those characters already match.

Why This is O(n+m)

Building the prefix table visits each pattern character once: O(m).

Searching advances through the text once: the outer loop is O(n). The inner while loop might seem problematic, but j can only decrease as many times as it increased, and it increases at most n times. Total: O(n).

Together: O(n+m).

Compare to naive O((n-m+1) × m). For n=1,000,000 and m=1000, that's 999 billion versus 1 million operations.

The Insight

KMP exploits pattern structure. When we've matched "AAB" and then mismatch, we know the last two characters of what we matched were "AB". If the pattern starts with "AB", we can skip ahead. If it doesn't, we might still skip partially.

The prefix table precomputes this information so we don't waste time during the search.

Think of it like reading a book and recognizing you've seen this paragraph before. You don't re-read it word by word; you skip to where the new content starts.

When KMP Shines

Guaranteed linear time: No worst case degradation like some other algorithms.

No backtracking in text: Critical for streaming data where you can't go backward.

Repeated searches: Build the prefix table once, search many times.

The Downside

For short patterns or small alphabets, the overhead of building the prefix table might not be worth it. Naive search can be faster for m < 10 or so.

Boyer-Moore is often faster in practice for longer patterns because it can skip characters, not just avoid backtracking.

But when you need guaranteed linear time, KMP delivers.