- Understand how Huffman builds prefix-free codes by merging smallest frequencies
- See why frequent characters get short codes, rare ones get long codes
- Learn why the greedy choice (merge smallest) produces optimal compression
Huffman Coding
You're sending the text "MISSISSIPPI" over a network. Each character in ASCII takes 8 bits. That's 11 characters × 8 bits = 88 bits total.
But look at the frequencies:
- I: 4 times
- S: 4 times
- P: 2 times
- M: 1 time
What if frequent characters used shorter codes? 'I' could be "0", 'S' could be "10", 'P' could be "110", 'M' could be "111".
Encoded: "111 0 10 10 0 10 10 0 110 110 0" = "111010010100101100110" (21 bits—way less than 88!)
That's Huffman coding. Build variable-length codes where frequent characters get short codes, rare ones get long codes.
The Problem: Uniquely Decodable Codes
You can't just assign random variable-length codes. "I"=0, "S"=1, "P"=10 has a problem.
Encode "IP": 0 + 10 = "010"
Decode "010": Could be "I" + "S" + "I" (0, 1, 0) or "I" + "P" (0, 10). Ambiguous!
The solution: prefix-free codes. No code is a prefix of another. If "I"=0, then nothing else can start with 0.
Huffman builds prefix-free codes by using a binary tree. Each character is a leaf. Left edges = 0, right edges = 1. Codes are paths from root to leaves.
Building the Huffman Tree
For "MISSISSIPPI":
Frequencies: M=1, P=2, I=4, S=4
Greedy strategy: repeatedly merge the two smallest frequencies.
Step 1: M(1) and P(2) are smallest. Merge them into a node with frequency 3.
(3)
/ \
M(1) P(2)
Remaining: I(4), S(4), Node(3)
Step 2: Node(3) and I(4) are smallest. Merge into frequency 7.
(7)
/ \
(3) I(4)
/ \
M(1) P(2)
Remaining: S(4), Node(7)
Step 3: Merge S(4) and Node(7) into root with frequency 11.
(11)
/ \
S(4) (7)
/ \
(3) I(4)
/ \
M(1) P(2)
Assign codes by paths:
- S: left from root = 0
- I: right, then right = 11
- P: right, then left, then right = 101
- M: right, then left, then left = 100
Codes: S=0, I=11, M=100, P=101
Encode "MISSISSIPPI":
M I S S I S S I P P I
100 11 0 0 11 0 0 11 101 101 11
= 10011001101101101101111
23 bits (vs 88 bits in ASCII).
Decode: Start at root, follow bits. When you hit a leaf, output that character and restart at root.
100: root→right→left→left = M
11: root→right→right = I
0: root→left = S
0: root→left = S
11: root→right→right = I
... and so on
Prefix-free property ensures unique decoding.
The Algorithm
import heapq
def huffman_coding(text):
# Count frequencies
freq = {}
for char in text:
freq[char] = freq.get(char, 0) + 1
# Build min heap of (frequency, node)
heap = []
for char, count in freq.items():
heapq.heappush(heap, (count, char, None, None)) # (freq, char, left, right)
# Build Huffman tree
while len(heap) > 1:
freq1, char1, left1, right1 = heapq.heappop(heap)
freq2, char2, left2, right2 = heapq.heappop(heap)
merged = (freq1 + freq2, None, (freq1, char1, left1, right1), (freq2, char2, left2, right2))
heapq.heappush(heap, merged)
# Root of tree
root = heap[0]
# Generate codes by traversing tree
codes = {}
def generate_codes(node, code):
freq, char, left, right = node
if char is not None: # Leaf node
codes[char] = code
else:
if left:
generate_codes(left, code + '0')
if right:
generate_codes(right, code + '1')
generate_codes(root, '')
return codes
# Encode
codes = huffman_coding("MISSISSIPPI")
encoded = ''.join(codes[c] for c in "MISSISSIPPI")
# Decode (given tree/codes)
# Traverse tree following bits until hitting a leaf
Time: O(n log n) for building the tree (heap operations), where n is number of unique characters.
Why This is Optimal
Huffman coding produces the optimal prefix-free code—no other prefix-free code can achieve shorter average length.
Proof sketch: The two least frequent characters should have the longest codes and differ only in the last bit (siblings in the tree). If not, you could swap them to reduce total length. Huffman's greedy choice (merge smallest) ensures this.
By induction, if the greedy choice is optimal for the smallest pair, and the remaining problem has the same structure, Huffman builds the optimal tree.
Example: Compressing "AAAAAABCCCCCCDDEEEEE"
Frequencies:
A: 6
C: 6
E: 5
D: 2
B: 1
Build tree by merging smallest:
Merge B(1) + D(2) = Node1(3)
Merge Node1(3) + E(5) = Node2(8)
Merge A(6) + C(6) = Node3(12)
Merge Node2(8) + Node3(12) = Root(20)
Tree:
(20)
/ \
(8) (12)
/ \ / \
(3) E(5) A(6) C(6)
/ \
B(1) D(2)
Codes:
- B: 000
- D: 001
- E: 01
- A: 10
- C: 11
Encoded: 10 10 10 10 10 10 000 11 11 11 11 11 11 001 001 01 01 01 01 01
Total bits: 6×2 + 1×3 + 6×2 + 2×3 + 5×2 = 12 + 3 + 12 + 6 + 10 = 43 bits
Fixed-length (3 bits per char since 5 unique chars need ⌈log₂5⌉=3 bits): 20 chars × 3 bits = 60 bits
Savings: 17 bits (28% reduction)
Variations
Canonical Huffman coding: Same lengths as Huffman but codes are assigned in a specific order, making it easier to transmit the tree.
Adaptive Huffman coding: Updates the tree as you encode, useful for streaming data where you don't know frequencies upfront.
Deflate (used in ZIP, gzip): Combines Huffman coding with LZ77 (dictionary compression) for better compression.
When Huffman Shines
Text compression: Where certain characters are much more frequent.
Image compression (JPEG): After transforming image data, Huffman encodes the frequency coefficients.
Network protocols: Reduce bandwidth by compressing packet headers or payloads.
Morse code: Not Huffman, but same idea—frequent letters like 'E' get short codes (dot), rare ones get longer codes.
The Greedy Choice
At each step, merge the two smallest frequencies. Why is this greedy choice optimal?
Intuition: The least frequent characters should be deepest in the tree (longest codes). If you merge them first, they end up deep. If you delay merging them, they'd be shallower (shorter codes), wasting bits on rare characters.
Greedy ensures rare characters get long codes, frequent ones get short codes, minimizing total length.
Limitations
Fixed codes: Huffman builds codes based on known frequencies. If the actual data has different frequencies, compression suffers.
Overhead: You must transmit the tree (or code table) along with the encoded data. For very small files, this overhead can negate savings.
Not arithmetic coding: Arithmetic coding can achieve even better compression by encoding entire messages into a single number, but it's more complex.
For most practical purposes, Huffman is simple, fast, and effective.
Decoding Process
Given encoded bits and the Huffman tree:
- Start at root
- Read a bit: 0 → go left, 1 → go right
- If you hit a leaf, output its character and restart at root
- Repeat until all bits consumed
For "10011001101101101101111" with our tree:
1: root→right (not a leaf)
0: →left (not a leaf)
0: →left = M. Output M, restart.
1: root→right
1: →right = I. Output I, restart.
0: root→left = S. Output S, restart.
0: root→left = S. Output S, restart.
1: root→right
1: →right = I. Output I, restart.
... continue
Result: MISSISSIPPI
The prefix-free property guarantees this process is unambiguous.
Why It's Called Greedy
At each step, Huffman makes the locally optimal choice: merge the two nodes with smallest frequency. This choice is irrevocable—you never reconsider.
Yet, this greedy strategy produces a globally optimal prefix-free code. That's the beauty of Huffman coding: simple, greedy, and optimal.
