- Understand requirements: deterministic, fast O(1), uniform distribution
- Learn polynomial rolling hash for strings with prime multiplier
- Implement __hash__() and __eq__() for custom objects as keys
Hash Functions
The hash function is the heart of a hash table. It determines everything: speed, collisions, distribution.
A bad hash function? Your O(1) hash table becomes O(n). All operations slow.
A good hash function? Fast, uniform distribution, minimal collisions.
What Makes a Good Hash Function
1. Deterministic: Same input → same output, always.
hash("Alice") # Always gives same number
If hash("Alice") gave different numbers each time, you couldn't find Alice's value later.
2. Fast: O(1) computation.
No expensive operations. Just arithmetic and bitwise ops.
3. Uniform distribution: Spread keys evenly across array.
Bad: All keys hash to index 0. Everything collides.
Good: Keys spread across [0, 1, 2, ..., size-1] evenly.
4. Minimize collisions: Different keys should rarely hash to same index.
Impossible to avoid completely (pigeonhole principle), but good functions minimize them.
Simple Hash for Integers
For integers, just use modulo:
def hash_int(key, size):
return key % size
Keys [12, 25, 38] with size 10:
- 12 % 10 = 2
- 25 % 10 = 5
- 38 % 10 = 8
No collisions if keys well-distributed.
Problem: If all keys are multiples of 10 (10, 20, 30), all hash to 0. Terrible.
Hash for Strings
Strings are sequences of characters. Convert to number.
Polynomial rolling hash:
def hash_string(s, size):
hash_val = 0
prime = 31 # Common choice
for char in s:
hash_val = (hash_val * prime + ord(char)) % size
return hash_val
For "cat":
- hash = 0
- hash = (0 * 31 + ord('c')) % size = 99 % size
- hash = (99 * 31 + ord('a')) % size = ...
- hash = (... * 31 + ord('t')) % size = final
Why 31? Prime number. Reduces patterns. Experimentally good distribution.
Division Method
def hash_division(key, size):
return key % size
Simple. Fast. Works if size is prime and keys are random.
Problem: If size is power of 2 (e.g., 16), only uses lower bits. Patterns in keys → patterns in hashes.
Multiplication Method
def hash_multiplication(key, size):
A = 0.6180339887 # (sqrt(5) - 1) / 2, golden ratio
return int(size * ((key * A) % 1))
Multiply by constant, take fractional part, multiply by size.
Less sensitive to size choice. Can use power of 2.
Universal Hashing
Choose hash function randomly from a family of functions.
import random
class UniversalHash:
def __init__(self, size):
self.size = size
self.p = 2**31 - 1 # Large prime
self.a = random.randint(1, self.p - 1)
self.b = random.randint(0, self.p - 1)
def hash(self, key):
return ((self.a * key + self.b) % self.p) % self.size
Different instances use different (a, b). Attacker can't predict collisions. Security benefit.
Python's hash()
Python has built-in hash() function.
print(hash("Alice")) # Some large integer
print(hash(42)) # 42 (for small ints, hash is identity)
print(hash((1, 2))) # Hash of tuple
Then modulo to get array index:
index = hash(key) % array_size
Python's hash is well-designed. Fast, good distribution.
Hashing Custom Objects
Want to use your own class as key? Implement __hash__() and __eq__().
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return hash((self.x, self.y))
def __eq__(self, other):
return self.x == other.x and self.y == other.y
points = {}
p1 = Point(1, 2)
points[p1] = "origin-ish"
p2 = Point(1, 2)
print(points[p2]) # "origin-ish" (p1 and p2 are equal)
Hash based on immutable fields. If object changes, hash shouldn't.
Avalanche Effect
Small change in key → large change in hash.
print(hash("Alice")) # e.g., 12345678
print(hash("Alicd")) # e.g., 98765432 (very different)
One character change, completely different hash. Good for distribution.
Common Mistakes
Using mutable objects as keys: If object changes, hash changes. Can't find it anymore.
# Wrong
my_dict = {}
my_list = [1, 2, 3]
my_dict[my_list] = "value" # TypeError: unhashable type: 'list'
# Right: use tuple (immutable)
my_tuple = (1, 2, 3)
my_dict[my_tuple] = "value" # Works
Ignoring distribution: Bad hash function clusters keys.
# Bad: All even numbers hash to 0, odd to 1
def bad_hash(key, size):
return key % 2
Not handling negative numbers:
# Wrong
hash(-5) % 10 # Might give negative index in some languages
# Right
abs(hash(-5)) % 10 # Ensure positive
The Pattern
Hash functions convert keys to array indices:
- Must be deterministic and fast
- Should distribute uniformly
- Use math tricks (primes, golden ratio)
- Python's built-in hash() is solid
Understanding hash functions means understanding why collisions happen and how to minimize them.
Master hash functions, and you'll design efficient hash tables.
