- Understand polynomial hash function and why it treats strings as base-p numbers
- Learn prefix hash arrays for O(1) substring hash queries
- See when to use double hashing to reduce collision probability
String Hashing
Comparing two strings character-by-character is slow. For strings of length n, that's O(n). Do it millions of times, and you're in trouble.
String hashing converts strings to numbers. Comparing numbers is O(1). That's the idea.
The Problem
You have a million strings and need to check if two are equal. Naive comparison:
def are_equal(s1, s2):
if len(s1) != len(s2):
return False
for i in range(len(s1)):
if s1[i] != s2[i]:
return False
return True
For a million comparisons of 1000-character strings, that's a billion character comparisons.
Hash them instead:
def are_equal_hash(s1, s2):
return hash(s1) == hash(s2) # O(1) after hashing
Of course, you need to compute the hashes first. But if you're comparing repeatedly, you hash once and compare many times.
Polynomial Hash
The standard way to hash strings uses polynomial evaluation:
hash("abc") = 'a' × p² + 'b' × p¹ + 'c' × p⁰
where p is a base (usually a prime like 31).
For "abc":
hash = 97 × 31² + 98 × 31 + 99
= 97 × 961 + 98 × 31 + 99
= 93,217 + 3,038 + 99
= 96,354
Why polynomial? Because it treats the string as a number in base-p. Just like "123" in base 10 is 1×10² + 2×10 + 3.
Preventing Overflow
For long strings, these numbers get huge. "abcdefghij" (10 chars) gives hash values in the billions.
Use modulo arithmetic:
hash("abc") = (97 × 31² + 98 × 31 + 99) mod M
where M is a large prime like 10⁹+7.
def hash_string(s):
BASE = 31
MOD = 10**9 + 7
hash_val = 0
for char in s:
hash_val = (hash_val * BASE + ord(char)) % MOD
return hash_val
For "abc":
hash = 0
hash = (0 × 31 + 97) % MOD = 97
hash = (97 × 31 + 98) % MOD = 3,105
hash = (3,105 × 31 + 99) % MOD = 96,354
Hash Collisions
Different strings can hash to the same value. That's a collision.
With MOD = 10⁹+7, there are about a billion possible hash values. If you have a million strings, collisions are possible but rare (by the birthday paradox).
When hashes match, verify character-by-character:
def are_equal_safe(s1, s2):
if hash_string(s1) != hash_string(s2):
return False # Definitely different
return s1 == s2 # Verify to handle collisions
Substring Hashing with Prefix Arrays
Here's where it gets useful. You want to hash every substring of a long string quickly.
Precompute prefix hashes:
def build_prefix_hash(s):
BASE = 31
MOD = 10**9 + 7
n = len(s)
prefix = [0] * (n + 1)
power = [1] * (n + 1)
for i in range(n):
prefix[i+1] = (prefix[i] * BASE + ord(s[i])) % MOD
power[i+1] = (power[i] * BASE) % MOD
return prefix, power
Now get the hash of s[L:R] in O(1):
def substring_hash(prefix, power, L, R):
MOD = 10**9 + 7
# Hash of s[L:R] = prefix[R] - prefix[L] × BASE^(R-L)
hash_val = (prefix[R] - prefix[L] * power[R-L]) % MOD
return hash_val if hash_val >= 0 else hash_val + MOD
For s="abcde", to get hash of "bcd" (indices 1-4):
prefix[4] = hash("abcd")
prefix[1] = hash("a")
hash("bcd") = hash("abcd") - hash("a") × BASE³
= (prefix[4] - prefix[1] × power[3]) mod MOD
This works because:
hash("abcd") = a×p³ + b×p² + c×p + d
hash("a") = a
hash("abcd") - hash("a")×p³ = b×p² + c×p + d = hash("bcd")
Double Hashing
To reduce collision probability, use two hash functions with different bases and modulos:
def double_hash(s):
BASE1, MOD1 = 31, 10**9 + 7
BASE2, MOD2 = 37, 10**9 + 9
hash1 = hash2 = 0
for char in s:
hash1 = (hash1 * BASE1 + ord(char)) % MOD1
hash2 = (hash2 * BASE2 + ord(char)) % MOD2
return (hash1, hash2)
Probability of collision with double hashing is roughly 1/(MOD1 × MOD2) ≈ 10⁻¹⁸. Effectively zero.
Applications
Finding duplicate substrings: Hash all k-length substrings, check for duplicates in O(n).
def find_duplicate_substring(s, k):
prefix, power = build_prefix_hash(s)
seen = set()
for i in range(len(s) - k + 1):
h = substring_hash(prefix, power, i, i + k)
if h in seen:
return s[i:i+k] # Found duplicate
seen.add(h)
return None
Longest common substring: Hash all substrings of both strings, find the longest match.
String equality in data structures: Use hashes as keys in hash tables for O(1) lookup instead of O(n) string comparison.
Rolling Hash vs Substring Hash
Rolling hash (Rabin-Karp): Compute next hash from previous in O(1) by removing left character and adding right character.
Substring hash (prefix arrays): Precompute all prefix hashes once, then get any substring hash in O(1).
Rolling hash is better for streaming data. Prefix arrays are better when you need random access to many substrings.
Choosing Parameters
Base (p): Use a prime slightly larger than alphabet size. For lowercase letters (26), use 31 or 29.
Modulo (M): Use a large prime. 10⁹+7 is popular. For critical applications, use double hashing with two large primes.
Overflow: In languages without automatic big integers, be careful with multiplication. Use (a * b) % MOD carefully—sometimes a * b overflows before the modulo.
When Hashing Wins
String hashing shines when you:
- Compare many strings repeatedly
- Need substring equality checks in O(1)
- Search for patterns in text (like Rabin-Karp)
- Find longest repeated substring
It's a power tool that makes string problems tractable. Without it, many algorithms would be too slow to be practical.
