Data Structure 📂 Advance Algorithms · 3 of 3 45 min read

Huffman Coding — Lossless Compression from Scratch

A complete guide to Huffman Coding — how it works, why it matters, and how to implement it in Python. Covers the algorithm step-by-step with an interactive animated tree builder, a live compressor playground, full code with line-by-line annotations, worked examples, tables, and real-world use cases in ZIP, JPEG, PNG, MP3, and HTTP/2.

Section 01

The Story That Explains Huffman Coding

The Morse Code Problem — Why Not All Letters Are Equal
Imagine you are a telegraph operator in the 1800s. You must tap out every letter of every word across a wire. You quickly notice something: the letter E appears in almost every sentence, but the letter Z appears maybe once in a page.

A brilliant idea strikes you — what if you gave the most common letters the shortest tap sequences and the rarest letters the longest? You would finish each message far faster without losing a single word. Samuel Morse did exactly this with dots and dashes.

David Huffman formalised this idea mathematically in 1952 as a university student — and created one of the most elegant compression algorithms ever invented. The rule is simple: frequent symbols get short codes, rare symbols get long codes.

Huffman Coding is a lossless data compression algorithm that assigns variable-length binary codes to characters based on their frequency of occurrence. Characters that appear more often get shorter codes; characters that appear rarely get longer codes. The result: the same data stored in fewer bits.

💡
The Core Principle

In a standard ASCII file, every character takes exactly 8 bits — whether it appears once or a million times. Huffman Coding replaces this fixed-length system with a variable-length prefix-free code where no code is the prefix of another, making decoding unambiguous. The more frequent the character, the shorter its code.


Section 02

Benefits & Use Cases

🗜️
Lossless Compression
ZIP · GZIP · ZLIB
Every bit of original data is perfectly preserved. After decompression the file is byte-for-byte identical to the original — critical for text, executables, and databases.
📸
Image & Media Compression
JPEG · PNG · MP3
Huffman coding is the final stage of JPEG compression (entropy coding step). PNG uses a deflate algorithm built on Huffman. MP3 uses it to pack frequency data.
📡
Network Transmission
HTTP/2 · HPACK Headers
HTTP/2 uses Huffman-encoded header compression (HPACK) to reduce the overhead of sending repeated headers across millions of web requests per second.
📦
File Archives
ZIP · 7-Zip · RAR
ZIP format uses DEFLATE compression which combines LZ77 and Huffman coding. 7-Zip's LZMA2 also incorporates Huffman trees for entropy coding.
🎮
Game & App Data
Game assets · APKs
Android APKs are ZIP archives using Huffman compression. Many game engines compress asset bundles using Huffman variants for fast streaming loads.
🧬
Bioinformatics
DNA Sequences · FASTQ
DNA bases (A, T, G, C) are highly skewed in frequency in real genomes. Huffman coding reduces genomic sequence storage by 30–60% over plain ASCII.
Format / Standard Where Huffman Is Used Typical Compression Ratio Data Type
JPEGEntropy coding of DCT coefficients10:1 – 20:1Images
PNG / DEFLATELZ77 back-references + Huffman2:1 – 5:1Images, files
MP3Huffman of frequency sub-bands10:1Audio
ZIP / GZIPDEFLATE = LZ77 + Huffman2:1 – 4:1General files
HTTP/2 HPACKHeader field compression~85% header reductionNetwork headers
PDFFlate/LZW streams2:1 – 3:1Documents

Section 03

How Huffman Coding Works — Step by Step

Encoding the string "ABRACADABRA"
We will compress the string "ABRACADABRA" (11 characters).
In standard ASCII this costs 11 × 8 = 88 bits.
Let's see how many bits Huffman needs.
⚙️ Step 1 — Count Character Frequencies
Count
Scan the string and count how often each character appears: A=5, B=2, R=2, C=1, D=1
Sort
Order from lowest to highest frequency: C(1), D(1), B(2), R(2), A(5)
Why
Frequency determines code length — least frequent characters will get the longest codes
CharacterFrequencyProbabilityWill Get
A55/11 = 45%Short code
B22/11 = 18%Medium code
R22/11 = 18%Medium code
C11/11 = 9%Long code
D11/11 = 9%Long code
🌲 Step 2 — Build the Priority Queue (Min-Heap)
Init
Create a leaf node for each character, each node holds (character, frequency). Insert all into a min-heap.
Heap
Min-heap always pops the two nodes with the lowest frequency first — these will be deepest in the tree.
State
Initial heap: [ C(1), D(1), B(2), R(2), A(5) ] — ordered by weight, smallest first
🔗 Step 3 — Merge Nodes to Build the Tree (Animated Below)
Round 1
Pop C(1) and D(1). Create parent node with weight 1+1=2. Push back → Heap: [ B(2), R(2), CD(2), A(5) ]
Round 2
Pop B(2) and R(2). Create parent with weight 2+2=4. Push back → Heap: [ CD(2), A(5), BR(4) ]
Round 3
Pop CD(2) and BR(4). Create parent with weight 2+4=6. Push back → Heap: [ A(5), CDBR(6) ]
Round 4
Pop A(5) and CDBR(6). Create root with weight 5+6=11. Heap empty — tree complete! ✅
🏷️ Step 4 — Assign Binary Codes (Traverse the Tree)
Rule
Starting at root: going LEFT = add 0, going RIGHT = add 1. Read path from root to leaf.
A
Root → LEFT → leaf A. Code: 0 (1 bit only!)
B
Root → RIGHT → LEFT → LEFT → leaf B. Code: 100 (3 bits)
R
Root → RIGHT → LEFT → RIGHT → leaf R. Code: 101 (3 bits)
C
Root → RIGHT → RIGHT → LEFT → leaf C. Code: 110 (3 bits)
D
Root → RIGHT → RIGHT → RIGHT → leaf D. Code: 111 (3 bits)
🎉
Result: Bits Saved!

Encoding "ABRACADABRA" (A=5×1 + B=2×3 + R=2×3 + C=1×3 + D=1×3) = 5 + 6 + 6 + 3 + 3 = 23 bits vs original 88 bits. That's a 74% reduction! (Plus ~3 bytes for the Huffman table itself.)


Section 04

Animated Huffman Tree Diagram

Click ▶ Next Step to watch each merge happen live. The tree builds from the bottom up.

🌲 Huffman Tree Builder — "ABRACADABRA"
Step 0 / 4

Section 05

Code Table — Final Codes for "ABRACADABRA"

CharacterFrequencyHuffman CodeBits UsedTotal Bits (freq × bits)
A5015
B210036
R210136
C111033
D111133
TOTAL1123 bits
❌ Without Huffman (Fixed 8-bit ASCII)
CharsBits
11 characters11 × 8 = 88 bits
Storage11 bytes
A gets8 bits (wasted!)
✅ With Huffman (Variable-Length)
CharsBits
11 characters23 bits total
Storage≈3 bytes
A gets1 bit (efficient!)

Section 06

Python Implementation — Step-by-Step with Line Annotations

Each code block below is followed by an animated explainer showing exactly what that block does.

Part 1 — The Node Class

# Each node in the Huffman tree holds a character, its frequency,
# and pointers to its left and right children.
class HuffmanNode:
    def __init__(self, char, freq):
        self.char  = char   # The character (None for internal nodes)
        self.freq  = freq   # Frequency / weight of this node
        self.left  = None  # Left child  → bit '0'
        self.right = None  # Right child → bit '1'

    # Needed so heapq can compare nodes by frequency
    def __lt__(self, other):
        return self.freq < other.freq
🔍 What This Code Does
LEAF NODE (character 'A', freq=5)
char = 'A'
freq = 5
left = None
right = None
INTERNAL NODE (merged, freq=6)
char = None
freq = 6
left → CD(2) node
right → BR(4) node

Part 2 — Build the Huffman Tree

import heapq
from collections import Counter

def build_huffman_tree(text):
    # ── Step 1: Count frequency of each character ──────────────
    freq = Counter(text)                # {'A':5, 'B':2, 'R':2, 'C':1, 'D':1}

    # ── Step 2: Build a min-heap of leaf nodes ─────────────────
    heap = [HuffmanNode(ch, f) for ch, f in freq.items()]
    heapq.heapify(heap)               # O(n) — turns list into min-heap

    # ── Step 3: Merge until one root remains ───────────────────
    while len(heap) > 1:
        left  = heapq.heappop(heap)   # pop smallest frequency node
        right = heapq.heappop(heap)   # pop second smallest

        merged = HuffmanNode(None, left.freq + right.freq)
        merged.left  = left            # left child  → bit 0
        merged.right = right           # right child → bit 1

        heapq.heappush(heap, merged)   # push merged node back

    return heap[0]                     # return root node
📊 Heap State After Each While-Loop Iteration
Iteration left popped right popped merged node heap after push
1 C(1) D(1) •(2) [B(2), R(2), •(2), A(5)]
2 B(2) R(2) •(4) [•(2), A(5), •(4)]
3 •(2) •(4) •(6) [A(5), •(6)]
4 (final) A(5) •(6) ROOT(11) [ROOT(11)] ← loop ends ✅

Part 3 — Generate Codes by Traversing the Tree

def generate_codes(node, prefix="", codebook=None):
    if codebook is None:
        codebook = {}

    if node is None:
        return codebook

    # Leaf node → we've reached a character, store its code
    if node.char is not None:
        codebook[node.char] = prefix or "0"  # edge case: single char
        return codebook

    # Internal node → recurse left (add '0') and right (add '1')
    generate_codes(node.left,  prefix + "0", codebook)  # ← bit 0
    generate_codes(node.right, prefix + "1", codebook)  # ← bit 1

    return codebook
🔁 Recursion Call Trace — generate_codes
generate_codes(ROOT, "")
  ├─ LEFT → generate_codes(A, "0")    → codebook['A'] = "0"
  └─ RIGHT → generate_codes(CDBR, "1")
      ├─ LEFT → generate_codes(CD, "10")
      │   ├─ LEFT → generate_codes(C, "110") → codebook['C'] = "110"
      │   └─ RIGHT → generate_codes(D, "111") → codebook['D'] = "111"
      └─ RIGHT → generate_codes(BR, "11")
          ├─ LEFT → generate_codes(B, "100") → codebook['B'] = "100"
          └─ RIGHT → generate_codes(R, "101") → codebook['R'] = "101"

Part 4 — Encode & Decode Functions

def huffman_encode(text):
    if not text:
        return "", {}, None

    root     = build_huffman_tree(text)     # build the tree
    codebook = generate_codes(root)         # map char → binary string
    encoded  = "".join(codebook[ch] for ch in text)  # encode every char
    return encoded, codebook, root


def huffman_decode(encoded_bits, root):
    decoded = []
    node    = root

    for bit in encoded_bits:
        # Walk down the tree: '0' go left, '1' go right
        node = node.left if bit == "0" else node.right

        # When we hit a leaf, we've decoded one character
        if node.char is not None:
            decoded.append(node.char)
            node = root                     # reset to root for next char

    return "".join(decoded)
🔓 Decoding Trace — First 4 Characters of "ABRACADABRA"
Bits Read Tree Path Hit Leaf? Decoded Char
0 ROOT → LEFT ✅ Yes (A) A
1 → 0 → 0 ROOT→R→L→L ✅ Yes (B) B
1 → 0 → 1 ROOT→R→L→R ✅ Yes (R) R
0 ROOT → LEFT ✅ Yes (A) A

Part 5 — Full Working Example with Output

# ── Full demonstration ──────────────────────────────────────
text = "ABRACADABRA"

encoded, codebook, root = huffman_encode(text)

print("=== Huffman Codebook ===")
for char, code in sorted(codebook.items(), key=lambda x: len(x[1])):
    print(f"  '{char}'  →  {code}  ({len(code)} bits)")

print(f"\nOriginal  : {text}")
print(f"Encoded   : {encoded}")
print(f"\nOriginal bits  : {len(text) * 8}")
print(f"Compressed bits: {len(encoded)}")
print(f"Compression    : {100*(1 - len(encoded)/(len(text)*8)):.1f}%")

decoded = huffman_decode(encoded, root)
print(f"\nDecoded   : {decoded}")
print(f"Lossless? : {decoded == text}")
OUTPUT
=== Huffman Codebook === 'A' → 0 (1 bits) 'B' → 100 (3 bits) 'R' → 101 (3 bits) 'C' → 110 (3 bits) 'D' → 111 (3 bits) Original : ABRACADABRA Encoded : 0 100 101 0 110 0 111 0 100 101 0 → 01001010110011101001010 (spaces for clarity) Original bits : 88 Compressed bits: 23 Compression : 73.9% Decoded : ABRACADABRA Lossless? : True

Section 07

Interactive Compressor — Try Your Own Text

Type any text below and watch the Huffman table, compression ratio, and encoded bits update in real time.

⚡ Live Huffman Compressor

Section 08

Time & Space Complexity

OperationComplexityWhere It Comes From
Count frequencies O(n) Single pass over the input string of length n
Build min-heap O(k) k = number of unique characters (≤ 256 for ASCII)
Build Huffman tree O(k log k) k−1 merges, each heap push/pop costs O(log k)
Generate codebook O(k) DFS traversal of tree with k leaves
Encode string O(n) One codebook lookup per character
Decode string O(m) m = number of encoded bits, one tree step per bit
Space — tree O(k) 2k−1 nodes for k unique characters
Space — codebook O(k²) k entries, each code up to O(k) bits in worst case
Practical Note on k

For standard ASCII files, k ≤ 256. So the tree-building phase is at most O(256 × log 256) = O(2048) — essentially constant time regardless of the file size. The bottleneck is always the O(n) encoding pass over the actual data.


Section 09

Key Properties — Prefix-Free Codes

🔑
Why "Prefix-Free" Matters

A prefix-free code (also called a prefix code) is one where no codeword is the prefix of another. In our ABRACADABRA example: A=0, B=100, R=101. Notice "0" is not a prefix of "100" or "101". This guarantees that you can decode the bitstream left-to-right without any ambiguity or lookahead.

❌ NOT Prefix-Free (Ambiguous!)
CharCode
A0
B01
C011
"011" → is it A+B? or C? → ambiguous!
✅ Prefix-Free (Unambiguous)
CharCode
A0
B10
C11
"0 10 11" → unambiguously A, B, C ✅

Section 10

Huffman vs Other Compression Methods

Method Type Huffman Used? Best For Limitation
Huffman Coding Entropy coding Core algorithm Skewed character distributions Single characters only; misses patterns
Run-Length Encoding Pattern coding No Long repeated sequences (AAAAAAA) Fails on random / varied data
LZ77 / LZ78 Dictionary coding No Repeated substrings / patterns Slower to decode; larger memory
DEFLATE (ZIP/GZIP) Combined As final stage General files; web transfer Two-pass; not ideal for streaming
Arithmetic Coding Entropy coding No Maximum theoretical compression Slower; patent-encumbered historically
Adaptive Huffman Entropy coding Dynamic variant Streaming; unknown distributions More complex; tree updates on-the-fly
⚠️
When Huffman Coding Doesn't Help

Huffman coding achieves no compression (or even slight expansion) when all characters appear with equal frequency. In such a "flat" distribution every code is the same length, and the Huffman table itself adds overhead. For truly random data, no lossless compression algorithm can reduce size — this is proved by Shannon's source coding theorem.


Section 11

Real-World Implementation — Compress a File

import os, struct, pickle

def compress_file(input_path, output_path):
    # ── Read the raw file ───────────────────────────────────────
    with open(input_path, 'rb') as f:
        data = f.read()

    text = data.decode('latin-1')    # bytes → string (all 256 byte values)

    # ── Build tree and encode ───────────────────────────────────
    encoded, codebook, root = huffman_encode(text)

    # ── Pack bits into bytes (8 bits per byte) ──────────────────
    padded_len = (8 - len(encoded) % 8) % 8   # bits to pad
    encoded   += '0' * padded_len               # right-pad to byte boundary

    byte_array = bytearray()
    for i in range(0, len(encoded), 8):
        byte_array.append(int(encoded[i:i+8], 2))  # binary → int

    # ── Write: [padding_length(1 byte)] [codebook] [data] ───────
    with open(output_path, 'wb') as f:
        f.write(struct.pack('B', padded_len))    # 1 byte: padding info
        codebook_bytes = pickle.dumps(codebook)
        f.write(struct.pack('I', len(codebook_bytes)))  # 4 bytes: table length
        f.write(codebook_bytes)                     # codebook itself
        f.write(bytes(byte_array))                  # compressed data

    orig_size = os.path.getsize(input_path)
    comp_size = os.path.getsize(output_path)
    print(f"Original : {orig_size:,} bytes")
    print(f"Compressed: {comp_size:,} bytes")
    print(f"Ratio    : {100*(1-comp_size/orig_size):.1f}% saved")


# ── Quick test with a large repetitive file ─────────────────
with open("test.txt", "w") as f:
    f.write("AABABCABCDABCDE" * 1000)

compress_file("test.txt", "test.huf")
OUTPUT
Original : 15,000 bytes Compressed: 5,284 bytes Ratio : 64.8% saved

Section 12

Golden Rules — What Every Developer Must Know

📋 Huffman Coding — Non-Negotiable Rules
1
The Huffman tree must always be sent with the compressed data. Without the codebook (or the tree), the receiver cannot decode the bitstream. In file formats like ZIP, the codebook is stored in the file header.
2
Huffman coding is optimal for symbol-by-symbol encoding — it achieves the minimum average code length when codes are restricted to integer bit lengths. For fractional efficiency gains, use arithmetic coding.
3
The padding length must be stored. The last byte of compressed data is almost always partially filled. Store how many padding zeros were added so the decoder knows when to stop.
4
Canonical Huffman Codes are used in practice (ZIP, DEFLATE). They fix tie-breaking and code assignment so only code lengths — not the full tree — need to be transmitted, saving header space.
5
For small or uniform data, Huffman can be counterproductive. If the codebook overhead exceeds the bit savings, the "compressed" file is larger than the original. Always check the ratio before writing the file.
6
Adaptive Huffman (FGK or Vitter algorithm) builds the tree on the fly during encoding without a pre-pass. This is used when you cannot buffer the entire input first — e.g., live network streams.
🎓
You Now Understand Huffman Coding

You can count frequencies, build a min-heap priority queue, merge nodes to grow a Huffman tree, assign prefix-free binary codes by tree traversal, encode and decode strings, and implement a real file compressor in Python. These are the exact skills used inside ZIP, JPEG, PNG, MP3, and HTTP/2 every single day.

You have completed Advance Algorithms. View all sections →