Huffman Encoding: Algorithm Guide

Thu, Feb 27, 2025

A step-by-step explanation of the Huffman coding algorithm.

1. Introduction to Huffman Coding

Huffman coding is an entropy encoding algorithm used for lossless data compression, developed by David A. Huffman in 1952. The key insight of Huffman coding is that more frequently occurring symbols are encoded with fewer bits than less frequently occurring symbols.

2. Core Concepts

2.1 Prefix Codes

Huffman codes are prefix codes, meaning no code word is a prefix of any other code word. This allows for unambiguous decoding.

Example of a prefix code:

‘A’ → ‘0’
‘B’ → ‘10’
‘C’ → ‘110’
‘D’ → ‘111’

Example of a non-prefix code:

‘A’ → ‘0’
‘B’ → ‘01’ (problematic because ‘0’ is a prefix of ‘01’)

2.2 Variable-Length Codes

Unlike fixed-length encoding (like ASCII where each character uses 8 bits), Huffman uses variable-length codes where:

Common characters get shorter codes
Rare characters get longer codes

3. The Huffman Algorithm

3.1 Frequency Analysis

The first step is to count how often each character appears in the input:

Example text: "ABBCCCDDDDEEEEE"
Frequencies: 'A':1, 'B':2, 'C':3, 'D':4, 'E':5

3.2 Building the Huffman Tree

The Huffman tree is built from the bottom up:

Create leaf nodes: Create a node for each character with its frequency as the weight.
Build a priority queue: Arrange all nodes in increasing order of their frequency.
Merge nodes: Repeatedly remove the two nodes with the lowest frequencies and create a new internal node with these two as children. The frequency of this new node is the sum of its children’s frequencies.
Continue merging: Repeat step 3 until only one node remains (the root).

Visual Example:

Starting with frequencies: ‘A’:1, ‘B’:2, ‘C’:3, ‘D’:4, ‘E’:5

Initial nodes in priority queue: [A:1, B:2, C:3, D:4, E:5]

Step 1: Remove A:1 and B:2
        Create internal node AB:3
        Queue: [AB:3, C:3, D:4, E:5]

Step 2: Remove AB:3 and C:3 (tie broken arbitrarily)
        Create internal node ABC:6
        Queue: [D:4, E:5, ABC:6]

Step 3: Remove D:4 and E:5
        Create internal node DE:9
        Queue: [ABC:6, DE:9]

Step 4: Remove ABC:6 and DE:9
        Create internal node ABCDE:15
        Queue: [ABCDE:15]

Detailed Tree Construction Process:

Let’s trace how the tree is built throughout the steps above. We’ll visualize each step to show how nodes are merged.

Initial Leaf Nodes:

A:1   B:2   C:3   D:4   E:5
(Each character is a separate leaf node)

After Step 1: We’ve removed A:1 and B:2 and created AB:3

  AB:3   C:3   D:4   E:5
 /    \
A:1    B:2

After Step 2: We’ve removed AB:3 and C:3 and created ABC:6

     ABC:6      D:4   E:5
    /     \
  AB:3     C:3
 /    \
A:1    B:2

After Step 3: We’ve removed D:4 and E:5 and created DE:9

     ABC:6        DE:9
    /     \      /    \
  AB:3     C:3  D:4    E:5
 /    \
A:1    B:2

After Step 4 (Final Tree): We’ve removed ABC:6 and DE:9 and created ABCDE:15

            ABCDE:15
           /        \
        ABC:6      DE:9
       /    \     /    \
    AB:3    C:3  D:4   E:5
   /   \
 A:1   B:2

Important Tree Construction Rules:

When two nodes are merged, the new internal node has the sum of the frequencies of its children.
The lower frequency node typically becomes the left child, and the higher frequency node becomes the right child (though this convention can vary in implementations).
Each internal node is labeled with the combined characters of its children and their summed frequency (e.g., AB:3 means this node represents characters A and B with a combined frequency of 3).
The final tree has all original characters as leaf nodes, and all internal nodes represent merged groups of characters.

3.3 Generating Codes

Once the tree is built, we assign:

‘0’ to each left branch
‘1’ to each right branch

The path from the root to each leaf node gives the Huffman code for the character at that leaf:

A: left, left, left → 000
B: left, left, right → 001
C: left, right → 01
D: right, left → 10
E: right, right → 11

This matches our tree structure:

            ABCDE
           /      \
         ABC      DE
        /   \    /  \
      AB     C  D    E
     /  \
    A    B

3.4 Encoding the Text

To encode, we replace each character with its code:

"ABBCCCDDDDEEEEE" becomes:
A      B      B      C     C     C     D     D     D     D     E     E     E     E     E
000    001    001    01    01    01    10    10    10    10    11    11    11    11    11

Final encoded: 000001001010101101010101111111111

3.5 Calculating Compression

Original size (8 bits per character × 15 characters) = 120 bits Encoded size = 33 bits Compression ratio = 120/33 ≈ 3.64:1 Space savings = (1 - 33/120) × 100% ≈ 72.5%

4. Decoding Process

Decoding is straightforward with the Huffman tree:

Start at the root of the tree
Read encoded bits one by one:
- If ‘0’, go to left child
- If ‘1’, go to right child
When you reach a leaf node, output its character and return to the root
Continue until all bits are processed

Example decoding the first few bits of “000001001…”:

bit '0' → go left (to ABC)
bit '0' → go left (to AB)
bit '0' → go left (to A) → reach 'A' → output 'A', return to root

bit '0' → go left (to ABC)
bit '0' → go left (to AB)
bit '1' → go right (to B) → reach 'B' → output 'B', return to root

...and so on

5. Edge Cases and Implementation Notes

5.1 Empty Input

Return an empty result or handle as a special case

5.2 Single Character Input

All codes will be a single bit (usually ‘0’)
Example: “AAAAA” → Each ‘A’ could be encoded as ‘0’

6. Practical Applications

Huffman coding is used in:

File compression formats (part of ZIP)
Image formats (JPEG uses a variant)
MP3 audio encoding
Data transmission to reduce bandwidth

By understanding Huffman coding, you’re learning a fundamental algorithm with widespread real-world applications!