Huffman Encoding: Algorithm Guide
A step-by-step explanation of the Huffman coding algorithm.
1. Introduction to Huffman Coding
Huffman coding is an entropy encoding algorithm used for lossless data compression, developed by David A. Huffman in 1952. The key insight of Huffman coding is that more frequently occurring symbols are encoded with fewer bits than less frequently occurring symbols.
2. Core Concepts
2.1 Prefix Codes
Huffman codes are prefix codes, meaning no code word is a prefix of any other code word. This allows for unambiguous decoding.
Example of a prefix code:
- ‘A’ → ‘0’
- ‘B’ → ‘10’
- ‘C’ → ‘110’
- ‘D’ → ‘111’
Example of a non-prefix code:
- ‘A’ → ‘0’
- ‘B’ → ‘01’ (problematic because ‘0’ is a prefix of ‘01’)
2.2 Variable-Length Codes
Unlike fixed-length encoding (like ASCII where each character uses 8 bits), Huffman uses variable-length codes where:
- Common characters get shorter codes
- Rare characters get longer codes
3. The Huffman Algorithm
3.1 Frequency Analysis
The first step is to count how often each character appears in the input:
Example text: "ABBCCCDDDDEEEEE"
Frequencies: 'A':1, 'B':2, 'C':3, 'D':4, 'E':5
3.2 Building the Huffman Tree
The Huffman tree is built from the bottom up:
- Create leaf nodes: Create a node for each character with its frequency as the weight.
- Build a priority queue: Arrange all nodes in increasing order of their frequency.
- Merge nodes: Repeatedly remove the two nodes with the lowest frequencies and create a new internal node with these two as children. The frequency of this new node is the sum of its children’s frequencies.
- Continue merging: Repeat step 3 until only one node remains (the root).
Visual Example:
Starting with frequencies: ‘A’:1, ‘B’:2, ‘C’:3, ‘D’:4, ‘E’:5
Initial nodes in priority queue: [A:1, B:2, C:3, D:4, E:5]
Step 1: Remove A:1 and B:2
Create internal node AB:3
Queue: [AB:3, C:3, D:4, E:5]
Step 2: Remove AB:3 and C:3 (tie broken arbitrarily)
Create internal node ABC:6
Queue: [D:4, E:5, ABC:6]
Step 3: Remove D:4 and E:5
Create internal node DE:9
Queue: [ABC:6, DE:9]
Step 4: Remove ABC:6 and DE:9
Create internal node ABCDE:15
Queue: [ABCDE:15]
Detailed Tree Construction Process:
Let’s trace how the tree is built throughout the steps above. We’ll visualize each step to show how nodes are merged.
Initial Leaf Nodes:
A:1 B:2 C:3 D:4 E:5
(Each character is a separate leaf node)
After Step 1: We’ve removed A:1 and B:2 and created AB:3
AB:3 C:3 D:4 E:5
/ \
A:1 B:2
After Step 2: We’ve removed AB:3 and C:3 and created ABC:6
ABC:6 D:4 E:5
/ \
AB:3 C:3
/ \
A:1 B:2
After Step 3: We’ve removed D:4 and E:5 and created DE:9
ABC:6 DE:9
/ \ / \
AB:3 C:3 D:4 E:5
/ \
A:1 B:2
After Step 4 (Final Tree): We’ve removed ABC:6 and DE:9 and created ABCDE:15
ABCDE:15
/ \
ABC:6 DE:9
/ \ / \
AB:3 C:3 D:4 E:5
/ \
A:1 B:2
Important Tree Construction Rules:
- When two nodes are merged, the new internal node has the sum of the frequencies of its children.
- The lower frequency node typically becomes the left child, and the higher frequency node becomes the right child (though this convention can vary in implementations).
- Each internal node is labeled with the combined characters of its children and their summed frequency (e.g., AB:3 means this node represents characters A and B with a combined frequency of 3).
- The final tree has all original characters as leaf nodes, and all internal nodes represent merged groups of characters.
3.3 Generating Codes
Once the tree is built, we assign:
- ‘0’ to each left branch
- ‘1’ to each right branch
The path from the root to each leaf node gives the Huffman code for the character at that leaf:
A: left, left, left → 000
B: left, left, right → 001
C: left, right → 01
D: right, left → 10
E: right, right → 11
This matches our tree structure:
ABCDE
/ \
ABC DE
/ \ / \
AB C D E
/ \
A B
3.4 Encoding the Text
To encode, we replace each character with its code:
"ABBCCCDDDDEEEEE" becomes:
A B B C C C D D D D E E E E E
000 001 001 01 01 01 10 10 10 10 11 11 11 11 11
Final encoded: 000001001010101101010101111111111
3.5 Calculating Compression
Original size (8 bits per character × 15 characters) = 120 bits Encoded size = 33 bits Compression ratio = 120/33 ≈ 3.64:1 Space savings = (1 - 33/120) × 100% ≈ 72.5%
4. Decoding Process
Decoding is straightforward with the Huffman tree:
- Start at the root of the tree
- Read encoded bits one by one:
- If ‘0’, go to left child
- If ‘1’, go to right child
- When you reach a leaf node, output its character and return to the root
- Continue until all bits are processed
Example decoding the first few bits of “000001001…”:
bit '0' → go left (to ABC)
bit '0' → go left (to AB)
bit '0' → go left (to A) → reach 'A' → output 'A', return to root
bit '0' → go left (to ABC)
bit '0' → go left (to AB)
bit '1' → go right (to B) → reach 'B' → output 'B', return to root
...and so on
5. Edge Cases and Implementation Notes
5.1 Empty Input
- Return an empty result or handle as a special case
5.2 Single Character Input
- All codes will be a single bit (usually ‘0’)
- Example: “AAAAA” → Each ‘A’ could be encoded as ‘0’
6. Practical Applications
Huffman coding is used in:
- File compression formats (part of ZIP)
- Image formats (JPEG uses a variant)
- MP3 audio encoding
- Data transmission to reduce bandwidth
By understanding Huffman coding, you’re learning a fundamental algorithm with widespread real-world applications!