String Hashing

Comparing two strings character-by-character is slow. For strings of length n, that's O(n). Do it millions of times, and you're in trouble.

String hashing converts strings to numbers. Comparing numbers is O(1). That's the idea.

The Problem

You have a million strings and need to check if two are equal. Naive comparison:

def are_equal(s1, s2):
    if len(s1) != len(s2):
        return False
    for i in range(len(s1)):
        if s1[i] != s2[i]:
            return False
    return True

For a million comparisons of 1000-character strings, that's a billion character comparisons.

Hash them instead:

def are_equal_hash(s1, s2):
    return hash(s1) == hash(s2)  # O(1) after hashing

Of course, you need to compute the hashes first. But if you're comparing repeatedly, you hash once and compare many times.

Polynomial Hash

The standard way to hash strings uses polynomial evaluation:

hash("abc") = 'a' × p² + 'b' × p¹ + 'c' × p⁰

where p is a base (usually a prime like 31).

For "abc":

hash = 97 × 31² + 98 × 31 + 99
     = 97 × 961 + 98 × 31 + 99
     = 93,217 + 3,038 + 99
     = 96,354

Why polynomial? Because it treats the string as a number in base-p. Just like "123" in base 10 is 1×10² + 2×10 + 3.

Preventing Overflow

For long strings, these numbers get huge. "abcdefghij" (10 chars) gives hash values in the billions.

Use modulo arithmetic:

hash("abc") = (97 × 31² + 98 × 31 + 99) mod M

where M is a large prime like 10⁹+7.

def hash_string(s):
    BASE = 31
    MOD = 10**9 + 7
    
    hash_val = 0
    for char in s:
        hash_val = (hash_val * BASE + ord(char)) % MOD
    
    return hash_val

For "abc":

hash = 0
hash = (0 × 31 + 97) % MOD = 97
hash = (97 × 31 + 98) % MOD = 3,105
hash = (3,105 × 31 + 99) % MOD = 96,354

Hash Collisions

Different strings can hash to the same value. That's a collision.

With MOD = 10⁹+7, there are about a billion possible hash values. If you have a million strings, collisions are possible but rare (by the birthday paradox).

When hashes match, verify character-by-character:

def are_equal_safe(s1, s2):
    if hash_string(s1) != hash_string(s2):
        return False  # Definitely different
    return s1 == s2  # Verify to handle collisions

Substring Hashing with Prefix Arrays

Here's where it gets useful. You want to hash every substring of a long string quickly.

Precompute prefix hashes:

def build_prefix_hash(s):
    BASE = 31
    MOD = 10**9 + 7
    n = len(s)
    
    prefix = [0] * (n + 1)
    power = [1] * (n + 1)
    
    for i in range(n):
        prefix[i+1] = (prefix[i] * BASE + ord(s[i])) % MOD
        power[i+1] = (power[i] * BASE) % MOD
    
    return prefix, power

Now get the hash of s[L:R] in O(1):

def substring_hash(prefix, power, L, R):
    MOD = 10**9 + 7
    # Hash of s[L:R] = prefix[R] - prefix[L] × BASE^(R-L)
    hash_val = (prefix[R] - prefix[L] * power[R-L]) % MOD
    return hash_val if hash_val >= 0 else hash_val + MOD

For s="abcde", to get hash of "bcd" (indices 1-4):

prefix[4] = hash("abcd")
prefix[1] = hash("a")

hash("bcd") = hash("abcd") - hash("a") × BASE³
            = (prefix[4] - prefix[1] × power[3]) mod MOD

This works because:

hash("abcd") = a×p³ + b×p² + c×p + d
hash("a")    = a

hash("abcd") - hash("a")×p³ = b×p² + c×p + d = hash("bcd")

Double Hashing

To reduce collision probability, use two hash functions with different bases and modulos:

def double_hash(s):
    BASE1, MOD1 = 31, 10**9 + 7
    BASE2, MOD2 = 37, 10**9 + 9
    
    hash1 = hash2 = 0
    for char in s:
        hash1 = (hash1 * BASE1 + ord(char)) % MOD1
        hash2 = (hash2 * BASE2 + ord(char)) % MOD2
    
    return (hash1, hash2)

Probability of collision with double hashing is roughly 1/(MOD1 × MOD2) ≈ 10⁻¹⁸. Effectively zero.

Applications

Finding duplicate substrings: Hash all k-length substrings, check for duplicates in O(n).

def find_duplicate_substring(s, k):
    prefix, power = build_prefix_hash(s)
    seen = set()
    
    for i in range(len(s) - k + 1):
        h = substring_hash(prefix, power, i, i + k)
        if h in seen:
            return s[i:i+k]  # Found duplicate
        seen.add(h)
    
    return None

Longest common substring: Hash all substrings of both strings, find the longest match.

String equality in data structures: Use hashes as keys in hash tables for O(1) lookup instead of O(n) string comparison.

Rolling Hash vs Substring Hash

Rolling hash (Rabin-Karp): Compute next hash from previous in O(1) by removing left character and adding right character.

Substring hash (prefix arrays): Precompute all prefix hashes once, then get any substring hash in O(1).

Rolling hash is better for streaming data. Prefix arrays are better when you need random access to many substrings.

Choosing Parameters

Base (p): Use a prime slightly larger than alphabet size. For lowercase letters (26), use 31 or 29.

Modulo (M): Use a large prime. 10⁹+7 is popular. For critical applications, use double hashing with two large primes.

Overflow: In languages without automatic big integers, be careful with multiplication. Use (a * b) % MOD carefully—sometimes a * b overflows before the modulo.

When Hashing Wins

String hashing shines when you:

Compare many strings repeatedly
Need substring equality checks in O(1)
Search for patterns in text (like Rabin-Karp)
Find longest repeated substring

It's a power tool that makes string problems tractable. Without it, many algorithms would be too slow to be practical.