Huffman Coding Tutorial – Python + Animated Examples

Section 01

The Story That Explains Huffman Coding

📖 Real World Analogy

The Morse Code Problem — Why Not All Letters Are Equal

Imagine you are a telegraph operator in the 1800s. You must tap out every letter of every word across a wire. You quickly notice something: the letter E appears in almost every sentence, but the letter Z appears maybe once in a page.

A brilliant idea strikes you — what if you gave the most common letters the shortest tap sequences and the rarest letters the longest? You would finish each message far faster without losing a single word. Samuel Morse did exactly this with dots and dashes.

David Huffman formalised this idea mathematically in 1952 as a university student — and created one of the most elegant compression algorithms ever invented. The rule is simple: frequent symbols get short codes, rare symbols get long codes.

Huffman Coding is a lossless data compression algorithm that assigns variable-length binary codes to characters based on their frequency of occurrence. Characters that appear more often get shorter codes; characters that appear rarely get longer codes. The result: the same data stored in fewer bits.

💡

The Core Principle

In a standard ASCII file, every character takes exactly 8 bits — whether it appears once or a million times. Huffman Coding replaces this fixed-length system with a variable-length prefix-free code where no code is the prefix of another, making decoding unambiguous. The more frequent the character, the shorter its code.

Section 02

Benefits & Use Cases

🗜️

Lossless Compression

ZIP · GZIP · ZLIB

Every bit of original data is perfectly preserved. After decompression the file is byte-for-byte identical to the original — critical for text, executables, and databases.

📸

Image & Media Compression

JPEG · PNG · MP3

Huffman coding is the final stage of JPEG compression (entropy coding step). PNG uses a deflate algorithm built on Huffman. MP3 uses it to pack frequency data.

📡

Network Transmission

HTTP/2 · HPACK Headers

HTTP/2 uses Huffman-encoded header compression (HPACK) to reduce the overhead of sending repeated headers across millions of web requests per second.

📦

File Archives

ZIP · 7-Zip · RAR

ZIP format uses DEFLATE compression which combines LZ77 and Huffman coding. 7-Zip's LZMA2 also incorporates Huffman trees for entropy coding.

🎮

Game & App Data

Game assets · APKs

Android APKs are ZIP archives using Huffman compression. Many game engines compress asset bundles using Huffman variants for fast streaming loads.

🧬

Bioinformatics

DNA Sequences · FASTQ

DNA bases (A, T, G, C) are highly skewed in frequency in real genomes. Huffman coding reduces genomic sequence storage by 30–60% over plain ASCII.

Format / Standard	Where Huffman Is Used	Typical Compression Ratio	Data Type
JPEG	Entropy coding of DCT coefficients	10:1 – 20:1	Images
PNG / DEFLATE	LZ77 back-references + Huffman	2:1 – 5:1	Images, files
MP3	Huffman of frequency sub-bands	10:1	Audio
ZIP / GZIP	DEFLATE = LZ77 + Huffman	2:1 – 4:1	General files
HTTP/2 HPACK	Header field compression	~85% header reduction	Network headers
PDF	Flate/LZW streams	2:1 – 3:1	Documents

Section 03

How Huffman Coding Works — Step by Step

📖 Worked Example

Encoding the string "ABRACADABRA"

We will compress the string "ABRACADABRA" (11 characters).
In standard ASCII this costs 11 × 8 = 88 bits.
Let's see how many bits Huffman needs.

⚙️ Step 1 — Count Character Frequencies

Count

Scan the string and count how often each character appears: A=5, B=2, R=2, C=1, D=1

Sort

Order from lowest to highest frequency: C(1), D(1), B(2), R(2), A(5)

Why

Frequency determines code length — least frequent characters will get the longest codes

Character	Frequency	Probability	Will Get
A	5	5/11 = 45%	Short code
B	2	2/11 = 18%	Medium code
R	2	2/11 = 18%	Medium code
C	1	1/11 = 9%	Long code
D	1	1/11 = 9%	Long code

🌲 Step 2 — Build the Priority Queue (Min-Heap)

Init

Create a leaf node for each character, each node holds (character, frequency). Insert all into a min-heap.

Heap

Min-heap always pops the two nodes with the lowest frequency first — these will be deepest in the tree.

State

Initial heap: [ C(1), D(1), B(2), R(2), A(5) ] — ordered by weight, smallest first

🔗 Step 3 — Merge Nodes to Build the Tree (Animated Below)

Round 1

Pop C(1) and D(1). Create parent node with weight 1+1=2. Push back → Heap: [ B(2), R(2), CD(2), A(5) ]

Round 2

Pop B(2) and R(2). Create parent with weight 2+2=4. Push back → Heap: [ CD(2), A(5), BR(4) ]

Round 3

Pop CD(2) and BR(4). Create parent with weight 2+4=6. Push back → Heap: [ A(5), CDBR(6) ]

Round 4

Pop A(5) and CDBR(6). Create root with weight 5+6=11. Heap empty — tree complete! ✅

🏷️ Step 4 — Assign Binary Codes (Traverse the Tree)

Rule

Starting at root: going LEFT = add 0, going RIGHT = add 1. Read path from root to leaf.

Root → LEFT → leaf A. Code: 0 (1 bit only!)

Root → RIGHT → LEFT → LEFT → leaf B. Code: 100 (3 bits)

Root → RIGHT → LEFT → RIGHT → leaf R. Code: 101 (3 bits)

Root → RIGHT → RIGHT → LEFT → leaf C. Code: 110 (3 bits)

Root → RIGHT → RIGHT → RIGHT → leaf D. Code: 111 (3 bits)

🎉

Result: Bits Saved!

Encoding "ABRACADABRA" (A=5×1 + B=2×3 + R=2×3 + C=1×3 + D=1×3) = 5 + 6 + 6 + 3 + 3 = 23 bits vs original 88 bits. That's a 74% reduction! (Plus ~3 bytes for the Huffman table itself.)

Section 04

Animated Huffman Tree Diagram

Click ▶ Next Step to watch each merge happen live. The tree builds from the bottom up.

🌲 Huffman Tree Builder — "ABRACADABRA"

Step 0 / 4

Section 05

Code Table — Final Codes for "ABRACADABRA"

Character	Frequency	Huffman Code	Bits Used	Total Bits (freq × bits)
A	5	0	1	5
B	2	100	3	6
R	2	101	3	6
C	1	110	3	3
D	1	111	3	3
TOTAL	11	—	—	23 bits

❌ Without Huffman (Fixed 8-bit ASCII)

Chars	Bits
11 characters	11 × 8 = 88 bits
Storage	11 bytes
A gets	8 bits (wasted!)

✅ With Huffman (Variable-Length)

Chars	Bits
11 characters	23 bits total
Storage	≈3 bytes
A gets	1 bit (efficient!)

Section 06

Python Implementation — Step-by-Step with Line Annotations

Each code block below is followed by an animated explainer showing exactly what that block does.

Part 1 — The Node Class

# Each node in the Huffman tree holds a character, its frequency,
# and pointers to its left and right children.
class HuffmanNode:
    def __init__(self, char, freq):
        self.char  = char   # The character (None for internal nodes)
        self.freq  = freq   # Frequency / weight of this node
        self.left  = None  # Left child  → bit '0'
        self.right = None  # Right child → bit '1'

    # Needed so heapq can compare nodes by frequency
    def __lt__(self, other):
        return self.freq < other.freq

🔍 What This Code Does

LEAF NODE (character 'A', freq=5)

        char  = 'A'

        freq  = 5

        left  = None

        right = None

INTERNAL NODE (merged, freq=6)

        char  = None

        freq  = 6

        left  → CD(2) node

        right → BR(4) node

Part 2 — Build the Huffman Tree

import heapq
from collections import Counter

def build_huffman_tree(text):
    # ── Step 1: Count frequency of each character ──────────────
    freq = Counter(text)                # {'A':5, 'B':2, 'R':2, 'C':1, 'D':1}

    # ── Step 2: Build a min-heap of leaf nodes ─────────────────
    heap = [HuffmanNode(ch, f) for ch, f in freq.items()]
    heapq.heapify(heap)               # O(n) — turns list into min-heap

    # ── Step 3: Merge until one root remains ───────────────────
    while len(heap) > 1:
        left  = heapq.heappop(heap)   # pop smallest frequency node
        right = heapq.heappop(heap)   # pop second smallest

        merged = HuffmanNode(None, left.freq + right.freq)
        merged.left  = left            # left child  → bit 0
        merged.right = right           # right child → bit 1

        heapq.heappush(heap, merged)   # push merged node back

    return heap[0]                     # return root node

📊 Heap State After Each While-Loop Iteration

Iteration	left popped	right popped	merged node	heap after push
1	C(1)	D(1)	•(2)	[B(2), R(2), •(2), A(5)]
2	B(2)	R(2)	•(4)	[•(2), A(5), •(4)]
3	•(2)	•(4)	•(6)	[A(5), •(6)]
4 (final)	A(5)	•(6)	ROOT(11)	[ROOT(11)] ← loop ends ✅

Part 3 — Generate Codes by Traversing the Tree

def generate_codes(node, prefix="", codebook=None):
    if codebook is None:
        codebook = {}

    if node is None:
        return codebook

    # Leaf node → we've reached a character, store its code
    if node.char is not None:
        codebook[node.char] = prefix or "0"  # edge case: single char
        return codebook

    # Internal node → recurse left (add '0') and right (add '1')
    generate_codes(node.left,  prefix + "0", codebook)  # ← bit 0
    generate_codes(node.right, prefix + "1", codebook)  # ← bit 1

    return codebook

🔁 Recursion Call Trace — generate_codes

    generate_codes(ROOT, "")

      ├─ LEFT → generate_codes(A, "0")    → codebook['A'] = "0"

      └─ RIGHT → generate_codes(CDBR, "1")

          ├─ LEFT → generate_codes(CD, "10")

          │   ├─ LEFT → generate_codes(C, "110") → codebook['C'] = "110"

          │   └─ RIGHT → generate_codes(D, "111") → codebook['D'] = "111"

          └─ RIGHT → generate_codes(BR, "11")

              ├─ LEFT → generate_codes(B, "100") → codebook['B'] = "100"

              └─ RIGHT → generate_codes(R, "101") → codebook['R'] = "101"

Part 4 — Encode & Decode Functions

def huffman_encode(text):
    if not text:
        return "", {}, None

    root     = build_huffman_tree(text)     # build the tree
    codebook = generate_codes(root)         # map char → binary string
    encoded  = "".join(codebook[ch] for ch in text)  # encode every char
    return encoded, codebook, root


def huffman_decode(encoded_bits, root):
    decoded = []
    node    = root

    for bit in encoded_bits:
        # Walk down the tree: '0' go left, '1' go right
        node = node.left if bit == "0" else node.right

        # When we hit a leaf, we've decoded one character
        if node.char is not None:
            decoded.append(node.char)
            node = root                     # reset to root for next char

    return "".join(decoded)

🔓 Decoding Trace — First 4 Characters of "ABRACADABRA"

Bits Read	Tree Path	Hit Leaf?	Decoded Char
0	ROOT → LEFT	✅ Yes (A)	A
1 → 0 → 0	ROOT→R→L→L	✅ Yes (B)	B
1 → 0 → 1	ROOT→R→L→R	✅ Yes (R)	R
0	ROOT → LEFT	✅ Yes (A)	A

Part 5 — Full Working Example with Output

# ── Full demonstration ──────────────────────────────────────
text = "ABRACADABRA"

encoded, codebook, root = huffman_encode(text)

print("=== Huffman Codebook ===")
for char, code in sorted(codebook.items(), key=lambda x: len(x[1])):
    print(f"  '{char}'  →  {code}  ({len(code)} bits)")

print(f"\nOriginal  : {text}")
print(f"Encoded   : {encoded}")
print(f"\nOriginal bits  : {len(text) * 8}")
print(f"Compressed bits: {len(encoded)}")
print(f"Compression    : {100*(1 - len(encoded)/(len(text)*8)):.1f}%")

decoded = huffman_decode(encoded, root)
print(f"\nDecoded   : {decoded}")
print(f"Lossless? : {decoded == text}")

OUTPUT

=== Huffman Codebook === 'A' → 0 (1 bits) 'B' → 100 (3 bits) 'R' → 101 (3 bits) 'C' → 110 (3 bits) 'D' → 111 (3 bits) Original : ABRACADABRA Encoded : 0 100 101 0 110 0 111 0 100 101 0 → 01001010110011101001010 (spaces for clarity) Original bits : 88 Compressed bits: 23 Compression : 73.9% Decoded : ABRACADABRA Lossless? : True

Section 07

Interactive Compressor — Try Your Own Text

Type any text below and watch the Huffman table, compression ratio, and encoded bits update in real time.

⚡ Live Huffman Compressor

Section 08

Time & Space Complexity

Operation	Complexity	Where It Comes From
Count frequencies	O(n)	Single pass over the input string of length n
Build min-heap	O(k)	k = number of unique characters (≤ 256 for ASCII)
Build Huffman tree	O(k log k)	k−1 merges, each heap push/pop costs O(log k)
Generate codebook	O(k)	DFS traversal of tree with k leaves
Encode string	O(n)	One codebook lookup per character
Decode string	O(m)	m = number of encoded bits, one tree step per bit
Space — tree	O(k)	2k−1 nodes for k unique characters
Space — codebook	O(k²)	k entries, each code up to O(k) bits in worst case

⚡

Practical Note on k

For standard ASCII files, k ≤ 256. So the tree-building phase is at most O(256 × log 256) = O(2048) — essentially constant time regardless of the file size. The bottleneck is always the O(n) encoding pass over the actual data.

Section 09

Key Properties — Prefix-Free Codes

🔑

Why "Prefix-Free" Matters

A prefix-free code (also called a prefix code) is one where no codeword is the prefix of another. In our ABRACADABRA example: A=0, B=100, R=101. Notice "0" is not a prefix of "100" or "101". This guarantees that you can decode the bitstream left-to-right without any ambiguity or lookahead.

❌ NOT Prefix-Free (Ambiguous!)

Char	Code
A	0
B	01
C	011

"011" → is it A+B? or C? → ambiguous!

✅ Prefix-Free (Unambiguous)

Char	Code
A	0
B	10
C	11

"0 10 11" → unambiguously A, B, C ✅

Section 10

Huffman vs Other Compression Methods

Method	Type	Huffman Used?	Best For	Limitation
Huffman Coding	Entropy coding	Core algorithm	Skewed character distributions	Single characters only; misses patterns
Run-Length Encoding	Pattern coding	No	Long repeated sequences (AAAAAAA)	Fails on random / varied data
LZ77 / LZ78	Dictionary coding	No	Repeated substrings / patterns	Slower to decode; larger memory
DEFLATE (ZIP/GZIP)	Combined	As final stage	General files; web transfer	Two-pass; not ideal for streaming
Arithmetic Coding	Entropy coding	No	Maximum theoretical compression	Slower; patent-encumbered historically
Adaptive Huffman	Entropy coding	Dynamic variant	Streaming; unknown distributions	More complex; tree updates on-the-fly

⚠️

When Huffman Coding Doesn't Help

Huffman coding achieves no compression (or even slight expansion) when all characters appear with equal frequency. In such a "flat" distribution every code is the same length, and the Huffman table itself adds overhead. For truly random data, no lossless compression algorithm can reduce size — this is proved by Shannon's source coding theorem.

Section 11

Real-World Implementation — Compress a File

import os, struct, pickle

def compress_file(input_path, output_path):
    # ── Read the raw file ───────────────────────────────────────
    with open(input_path, 'rb') as f:
        data = f.read()

    text = data.decode('latin-1')    # bytes → string (all 256 byte values)

    # ── Build tree and encode ───────────────────────────────────
    encoded, codebook, root = huffman_encode(text)

    # ── Pack bits into bytes (8 bits per byte) ──────────────────
    padded_len = (8 - len(encoded) % 8) % 8   # bits to pad
    encoded   += '0' * padded_len               # right-pad to byte boundary

    byte_array = bytearray()
    for i in range(0, len(encoded), 8):
        byte_array.append(int(encoded[i:i+8], 2))  # binary → int

    # ── Write: [padding_length(1 byte)] [codebook] [data] ───────
    with open(output_path, 'wb') as f:
        f.write(struct.pack('B', padded_len))    # 1 byte: padding info
        codebook_bytes = pickle.dumps(codebook)
        f.write(struct.pack('I', len(codebook_bytes)))  # 4 bytes: table length
        f.write(codebook_bytes)                     # codebook itself
        f.write(bytes(byte_array))                  # compressed data

    orig_size = os.path.getsize(input_path)
    comp_size = os.path.getsize(output_path)
    print(f"Original : {orig_size:,} bytes")
    print(f"Compressed: {comp_size:,} bytes")
    print(f"Ratio    : {100*(1-comp_size/orig_size):.1f}% saved")


# ── Quick test with a large repetitive file ─────────────────
with open("test.txt", "w") as f:
    f.write("AABABCABCDABCDE" * 1000)

compress_file("test.txt", "test.huf")

OUTPUT

Original : 15,000 bytes Compressed: 5,284 bytes Ratio : 64.8% saved

Section 12

Golden Rules — What Every Developer Must Know

📋 Huffman Coding — Non-Negotiable Rules

The Huffman tree must always be sent with the compressed data. Without the codebook (or the tree), the receiver cannot decode the bitstream. In file formats like ZIP, the codebook is stored in the file header.

Huffman coding is optimal for symbol-by-symbol encoding — it achieves the minimum average code length when codes are restricted to integer bit lengths. For fractional efficiency gains, use arithmetic coding.

The padding length must be stored. The last byte of compressed data is almost always partially filled. Store how many padding zeros were added so the decoder knows when to stop.

Canonical Huffman Codes are used in practice (ZIP, DEFLATE). They fix tie-breaking and code assignment so only code lengths — not the full tree — need to be transmitted, saving header space.

For small or uniform data, Huffman can be counterproductive. If the codebook overhead exceeds the bit savings, the "compressed" file is larger than the original. Always check the ratio before writing the file.

Adaptive Huffman (FGK or Vitter algorithm) builds the tree on the fly during encoding without a pre-pass. This is used when you cannot buffer the entire input first — e.g., live network streams.

🎓

You Now Understand Huffman Coding

You can count frequencies, build a min-heap priority queue, merge nodes to grow a Huffman tree, assign prefix-free binary codes by tree traversal, encode and decode strings, and implement a real file compressor in Python. These are the exact skills used inside ZIP, JPEG, PNG, MP3, and HTTP/2 every single day.