Huffman Encoding: Algorithm Guide

A step-by-step explanation of the Huffman coding algorithm.

1. Introduction to Huffman Coding

Huffman coding is an entropy encoding algorithm used for lossless data compression, developed by David A. Huffman in 1952. The key insight of Huffman coding is that more frequently occurring symbols are encoded with fewer bits than less frequently occurring symbols.

2. Core Concepts

2.1 Prefix Codes

Huffman codes are prefix codes, meaning no code word is a prefix of any other code word. This allows for unambiguous decoding.

Example of a prefix code:

Example of a non-prefix code:

2.2 Variable-Length Codes

Unlike fixed-length encoding (like ASCII where each character uses 8 bits), Huffman uses variable-length codes where:

3. The Huffman Algorithm

3.1 Frequency Analysis

The first step is to count how often each character appears in the input:

Example text: "ABBCCCDDDDEEEEE"
Frequencies: 'A':1, 'B':2, 'C':3, 'D':4, 'E':5

3.2 Building the Huffman Tree

The Huffman tree is built from the bottom up:

  1. Create leaf nodes: Create a node for each character with its frequency as the weight.
  2. Build a priority queue: Arrange all nodes in increasing order of their frequency.
  3. Merge nodes: Repeatedly remove the two nodes with the lowest frequencies and create a new internal node with these two as children. The frequency of this new node is the sum of its children’s frequencies.
  4. Continue merging: Repeat step 3 until only one node remains (the root).

Visual Example:

Starting with frequencies: ‘A’:1, ‘B’:2, ‘C’:3, ‘D’:4, ‘E’:5

Initial nodes in priority queue: [A:1, B:2, C:3, D:4, E:5]

Step 1: Remove A:1 and B:2
        Create internal node AB:3
        Queue: [AB:3, C:3, D:4, E:5]

Step 2: Remove AB:3 and C:3 (tie broken arbitrarily)
        Create internal node ABC:6
        Queue: [D:4, E:5, ABC:6]

Step 3: Remove D:4 and E:5
        Create internal node DE:9
        Queue: [ABC:6, DE:9]

Step 4: Remove ABC:6 and DE:9
        Create internal node ABCDE:15
        Queue: [ABCDE:15]

Detailed Tree Construction Process:

Let’s trace how the tree is built throughout the steps above. We’ll visualize each step to show how nodes are merged.

Initial Leaf Nodes:

A:1   B:2   C:3   D:4   E:5
(Each character is a separate leaf node)

After Step 1: We’ve removed A:1 and B:2 and created AB:3

  AB:3   C:3   D:4   E:5
 /    \
A:1    B:2

After Step 2: We’ve removed AB:3 and C:3 and created ABC:6

     ABC:6      D:4   E:5
    /     \
  AB:3     C:3
 /    \
A:1    B:2

After Step 3: We’ve removed D:4 and E:5 and created DE:9

     ABC:6        DE:9
    /     \      /    \
  AB:3     C:3  D:4    E:5
 /    \
A:1    B:2

After Step 4 (Final Tree): We’ve removed ABC:6 and DE:9 and created ABCDE:15

            ABCDE:15
           /        \
        ABC:6      DE:9
       /    \     /    \
    AB:3    C:3  D:4   E:5
   /   \
 A:1   B:2

Important Tree Construction Rules:

  1. When two nodes are merged, the new internal node has the sum of the frequencies of its children.
  2. The lower frequency node typically becomes the left child, and the higher frequency node becomes the right child (though this convention can vary in implementations).
  3. Each internal node is labeled with the combined characters of its children and their summed frequency (e.g., AB:3 means this node represents characters A and B with a combined frequency of 3).
  4. The final tree has all original characters as leaf nodes, and all internal nodes represent merged groups of characters.

3.3 Generating Codes

Once the tree is built, we assign:

The path from the root to each leaf node gives the Huffman code for the character at that leaf:

A: left, left, left → 000
B: left, left, right → 001
C: left, right → 01
D: right, left → 10
E: right, right → 11

This matches our tree structure:

            ABCDE
           /      \
         ABC      DE
        /   \    /  \
      AB     C  D    E
     /  \
    A    B

3.4 Encoding the Text

To encode, we replace each character with its code:

"ABBCCCDDDDEEEEE" becomes:
A      B      B      C     C     C     D     D     D     D     E     E     E     E     E
000    001    001    01    01    01    10    10    10    10    11    11    11    11    11

Final encoded: 000001001010101101010101111111111

3.5 Calculating Compression

Original size (8 bits per character × 15 characters) = 120 bits Encoded size = 33 bits Compression ratio = 120/33 ≈ 3.64:1 Space savings = (1 - 33/120) × 100% ≈ 72.5%

4. Decoding Process

Decoding is straightforward with the Huffman tree:

  1. Start at the root of the tree
  2. Read encoded bits one by one:
    • If ‘0’, go to left child
    • If ‘1’, go to right child
  3. When you reach a leaf node, output its character and return to the root
  4. Continue until all bits are processed

Example decoding the first few bits of “000001001…”:

bit '0' → go left (to ABC)
bit '0' → go left (to AB)
bit '0' → go left (to A) → reach 'A' → output 'A', return to root

bit '0' → go left (to ABC)
bit '0' → go left (to AB)
bit '1' → go right (to B) → reach 'B' → output 'B', return to root

...and so on

5. Edge Cases and Implementation Notes

5.1 Empty Input

5.2 Single Character Input

6. Practical Applications

Huffman coding is used in:

By understanding Huffman coding, you’re learning a fundamental algorithm with widespread real-world applications!