Rabin-Karp Algorithm

KMP avoids backtracking by using a prefix table. Rabin-Karp takes a different approach: use hashing to compare substrings in constant time.

The idea: comparing strings character-by-character is slow. Comparing numbers is fast. So convert each substring to a number (a hash) and compare those.

Hash Functions for Strings

A hash function maps a string to a number. For "abc", we might do:

hash("abc") = 'a' × 31² + 'b' × 31¹ + 'c' × 31⁰
            = 97 × 961 + 98 × 31 + 99 × 1
            = 93,217 + 3,038 + 99
            = 96,354

Why 31? It's prime, it's small enough to avoid overflow quickly, and it distributes hash values well. You could use 29, 37, 101—any prime works.

Now to search for "bc" in "abcd", we:

Hash the pattern: hash("bc") = 98 × 31 + 99 = 3,137
Hash the first window: hash("ab") = 97 × 31 + 98 = 3,105 ≠ 3,137
Hash the next window: hash("bc") = 98 × 31 + 99 = 3,137 ✓ Match!

But computing hash("bc") from scratch is O(m). We'd be back to O((n-m) × m) total time.

Rolling Hash

The trick: compute the next hash from the previous one in O(1).

When we slide the window from "ab" to "bc", we:

Remove 'a' (the leftmost character)
Shift everything left (multiply by base)
Add 'c' (the new rightmost character)

hash("ab") = 'a' × 31¹ + 'b' × 31⁰ = 97×31 + 98 = 3,105

To get hash("bc"):
1. Subtract 'a' × 31¹:      3,105 - (97 × 31) = 3,105 - 3,007 = 98
2. Multiply by 31 (shift):  98 × 31 = 3,038
3. Add 'c':                 3,038 + 99 = 3,137

In one line:

hash("bc") = (hash("ab") - 'a' × 31^(m-1)) × 31 + 'c'

This is O(1) per window, giving O(n) total for n windows.

The Algorithm

def rabin_karp(text, pattern):
    BASE = 31
    n, m = len(text), len(pattern)
    
    if m > n:
        return []
    
    # Compute hash of pattern
    pattern_hash = 0
    for char in pattern:
        pattern_hash = pattern_hash * BASE + ord(char)
    
    # Compute hash of first window
    text_hash = 0
    for i in range(m):
        text_hash = text_hash * BASE + ord(text[i])
    
    # Precompute BASE^(m-1) for rolling hash
    h = BASE ** (m - 1)
    
    matches = []
    
    # Check first window
    if text_hash == pattern_hash:
        if text[:m] == pattern:  # Verify to avoid false positives
            matches.append(0)
    
    # Roll through remaining windows
    for i in range(1, n - m + 1):
        # Remove leading character, add trailing character
        text_hash = (text_hash - ord(text[i-1]) * h) * BASE + ord(text[i+m-1])
        
        if text_hash == pattern_hash:
            if text[i:i+m] == pattern:  # Verify
                matches.append(i)
    
    return matches

The Verification Step

Why do we verify with text[i:i+m] == pattern even when hashes match?

Hash collisions: Different strings can hash to the same value.

For example, with BASE=31:

hash("ab") = 97 × 31 + 98 = 3,105
hash("L") = 76 (if we use 1-char)... actually that doesn't work

Let me use a realistic example. With a small modulo (to force collisions):

MOD = 100
hash("AB") = (65 × 31 + 66) % 100 = 2,081 % 100 = 81
hash("CZ") = (67 × 31 + 90) % 100 = 2,167 % 100 = 67

Different hashes, no collision here.

Collisions are rare with good hash functions and large modulos, but they can happen. So when hashes match, we verify character-by-character.

Complexity

Time: O(n + m) average case.

Compute pattern hash: O(m)
Compute first window hash: O(m)
Roll through windows: O(n) rolling + O(k×m) verification, where k is number of hash matches

If k is small (few collisions), verification is negligible. Average: O(n+m).

Worst case: O((n-m+1) × m) if every window has a hash collision and we verify every time. This is unlikely with a good hash function.

Space: O(1) extra space.

Practical Improvements

Use modulo to prevent overflow:

MOD = 10**9 + 7  # Large prime

pattern_hash = 0
for char in pattern:
    pattern_hash = (pattern_hash * BASE + ord(char)) % MOD

Use multiple hash functions (double hashing) to reduce collisions:

BASE1, MOD1 = 31, 10**9 + 7
BASE2, MOD2 = 37, 10**9 + 9

hash1 = compute_hash(pattern, BASE1, MOD1)
hash2 = compute_hash(pattern, BASE2, MOD2)

# Match only if both hashes match
if text_hash1 == hash1 and text_hash2 == hash2:
    verify...

When to Use Rabin-Karp

Multiple pattern search: If you're searching for many patterns, compute all their hashes first, then roll through the text once checking against all hashes. This is more efficient than running KMP multiple times.

Plagiarism detection: Find common substrings between documents. Hash all k-length substrings of both documents, find matches.

Rolling hash in other problems: The rolling hash technique appears in problems beyond string matching—like finding duplicate files, comparing genomic sequences, etc.

vs KMP

KMP guarantees O(n+m) worst case. Rabin-Karp is O(n+m) average, O(nm) worst.

KMP never has false positives. Rabin-Karp might due to hash collisions.

Rabin-Karp is simpler to implement and understand. No prefix table, just hash math.

Rabin-Karp extends easily to 2D pattern matching, multiple patterns, etc.

The Rolling Hash Trick

The real value of Rabin-Karp isn't just for string matching. The rolling hash technique is a general tool.

Finding the longest duplicate substring? Hash all substrings of length k using rolling hash, check for duplicates in O(n).

Comparing two long strings for equality of substrings? Hash both, compare hashes in O(1).

It's a powerful primitive that shows up in many algorithms.