Huffman Coding

You're sending the text "MISSISSIPPI" over a network. Each character in ASCII takes 8 bits. That's 11 characters × 8 bits = 88 bits total.

But look at the frequencies:

I: 4 times
S: 4 times
P: 2 times
M: 1 time

What if frequent characters used shorter codes? 'I' could be "0", 'S' could be "10", 'P' could be "110", 'M' could be "111".

Encoded: "111 0 10 10 0 10 10 0 110 110 0" = "111010010100101100110" (21 bits—way less than 88!)

That's Huffman coding. Build variable-length codes where frequent characters get short codes, rare ones get long codes.

The Problem: Uniquely Decodable Codes

You can't just assign random variable-length codes. "I"=0, "S"=1, "P"=10 has a problem.

Encode "IP": 0 + 10 = "010"

Decode "010": Could be "I" + "S" + "I" (0, 1, 0) or "I" + "P" (0, 10). Ambiguous!

The solution: prefix-free codes. No code is a prefix of another. If "I"=0, then nothing else can start with 0.

Huffman builds prefix-free codes by using a binary tree. Each character is a leaf. Left edges = 0, right edges = 1. Codes are paths from root to leaves.

Building the Huffman Tree

For "MISSISSIPPI":

Frequencies: M=1, P=2, I=4, S=4

Greedy strategy: repeatedly merge the two smallest frequencies.

Step 1: M(1) and P(2) are smallest. Merge them into a node with frequency 3.
        (3)
       /   \
      M(1) P(2)

Remaining: I(4), S(4), Node(3)

Step 2: Node(3) and I(4) are smallest. Merge into frequency 7.
          (7)
         /   \
      (3)    I(4)
     /   \
    M(1) P(2)

Remaining: S(4), Node(7)

Step 3: Merge S(4) and Node(7) into root with frequency 11.
            (11)
           /    \
         S(4)   (7)
               /   \
             (3)   I(4)
            /   \
          M(1) P(2)

Assign codes by paths:

S: left from root = 0
I: right, then right = 11
P: right, then left, then right = 101
M: right, then left, then left = 100

Codes: S=0, I=11, M=100, P=101

Encode "MISSISSIPPI":

M   I  S  S  I  S  S  I  P   P   I
100 11 0  0  11 0  0  11 101 101 11
= 10011001101101101101111

23 bits (vs 88 bits in ASCII).

Decode: Start at root, follow bits. When you hit a leaf, output that character and restart at root.

100: root→right→left→left = M
11: root→right→right = I
0: root→left = S
0: root→left = S
11: root→right→right = I
... and so on

Prefix-free property ensures unique decoding.

The Algorithm

import heapq

def huffman_coding(text):
    # Count frequencies
    freq = {}
    for char in text:
        freq[char] = freq.get(char, 0) + 1
    
    # Build min heap of (frequency, node)
    heap = []
    for char, count in freq.items():
        heapq.heappush(heap, (count, char, None, None))  # (freq, char, left, right)
    
    # Build Huffman tree
    while len(heap) > 1:
        freq1, char1, left1, right1 = heapq.heappop(heap)
        freq2, char2, left2, right2 = heapq.heappop(heap)
        
        merged = (freq1 + freq2, None, (freq1, char1, left1, right1), (freq2, char2, left2, right2))
        heapq.heappush(heap, merged)
    
    # Root of tree
    root = heap[0]
    
    # Generate codes by traversing tree
    codes = {}
    def generate_codes(node, code):
        freq, char, left, right = node
        if char is not None:  # Leaf node
            codes[char] = code
        else:
            if left:
                generate_codes(left, code + '0')
            if right:
                generate_codes(right, code + '1')
    
    generate_codes(root, '')
    
    return codes

# Encode
codes = huffman_coding("MISSISSIPPI")
encoded = ''.join(codes[c] for c in "MISSISSIPPI")

# Decode (given tree/codes)
# Traverse tree following bits until hitting a leaf

Time: O(n log n) for building the tree (heap operations), where n is number of unique characters.

Why This is Optimal

Huffman coding produces the optimal prefix-free code—no other prefix-free code can achieve shorter average length.

Proof sketch: The two least frequent characters should have the longest codes and differ only in the last bit (siblings in the tree). If not, you could swap them to reduce total length. Huffman's greedy choice (merge smallest) ensures this.

By induction, if the greedy choice is optimal for the smallest pair, and the remaining problem has the same structure, Huffman builds the optimal tree.

Example: Compressing "AAAAAABCCCCCCDDEEEEE"

Frequencies:

A: 6
C: 6
E: 5
D: 2
B: 1

Build tree by merging smallest:

Merge B(1) + D(2) = Node1(3)
Merge Node1(3) + E(5) = Node2(8)
Merge A(6) + C(6) = Node3(12)
Merge Node2(8) + Node3(12) = Root(20)

Tree:
              (20)
            /      \
         (8)        (12)
        /   \       /    \
      (3)   E(5)  A(6)  C(6)
     /   \
   B(1) D(2)

Codes:

B: 000
D: 001
E: 01
A: 10
C: 11

Encoded: 10 10 10 10 10 10 000 11 11 11 11 11 11 001 001 01 01 01 01 01

Total bits: 6×2 + 1×3 + 6×2 + 2×3 + 5×2 = 12 + 3 + 12 + 6 + 10 = 43 bits

Fixed-length (3 bits per char since 5 unique chars need ⌈log₂5⌉=3 bits): 20 chars × 3 bits = 60 bits

Savings: 17 bits (28% reduction)

Variations

Canonical Huffman coding: Same lengths as Huffman but codes are assigned in a specific order, making it easier to transmit the tree.

Adaptive Huffman coding: Updates the tree as you encode, useful for streaming data where you don't know frequencies upfront.

Deflate (used in ZIP, gzip): Combines Huffman coding with LZ77 (dictionary compression) for better compression.

When Huffman Shines

Text compression: Where certain characters are much more frequent.

Image compression (JPEG): After transforming image data, Huffman encodes the frequency coefficients.

Network protocols: Reduce bandwidth by compressing packet headers or payloads.

Morse code: Not Huffman, but same idea—frequent letters like 'E' get short codes (dot), rare ones get longer codes.

The Greedy Choice

At each step, merge the two smallest frequencies. Why is this greedy choice optimal?

Intuition: The least frequent characters should be deepest in the tree (longest codes). If you merge them first, they end up deep. If you delay merging them, they'd be shallower (shorter codes), wasting bits on rare characters.

Greedy ensures rare characters get long codes, frequent ones get short codes, minimizing total length.

Limitations

Fixed codes: Huffman builds codes based on known frequencies. If the actual data has different frequencies, compression suffers.

Overhead: You must transmit the tree (or code table) along with the encoded data. For very small files, this overhead can negate savings.

Not arithmetic coding: Arithmetic coding can achieve even better compression by encoding entire messages into a single number, but it's more complex.

For most practical purposes, Huffman is simple, fast, and effective.

Decoding Process

Given encoded bits and the Huffman tree:

Start at root
Read a bit: 0 → go left, 1 → go right
If you hit a leaf, output its character and restart at root
Repeat until all bits consumed

For "10011001101101101101111" with our tree:

1: root→right (not a leaf)
0: →left (not a leaf)
0: →left = M. Output M, restart.
1: root→right
1: →right = I. Output I, restart.
0: root→left = S. Output S, restart.
0: root→left = S. Output S, restart.
1: root→right
1: →right = I. Output I, restart.
... continue

Result: MISSISSIPPI

The prefix-free property guarantees this process is unambiguous.

Why It's Called Greedy

At each step, Huffman makes the locally optimal choice: merge the two nodes with smallest frequency. This choice is irrevocable—you never reconsider.

Yet, this greedy strategy produces a globally optimal prefix-free code. That's the beauty of Huffman coding: simple, greedy, and optimal.