Huffman Encoding: Assignment
In this assignment, you will implement Huffman coding, a lossless data compression algorithm that assigns variable-length codes to input characters based on their frequencies.
Learning Objectives
In this assignment you will:
- Understand the principles of lossless compression
- Implement a Huffman tree data structure
- Apply frequency analysis to text
- Generate optimal prefix codes
- Encode and decode text using Huffman coding
Huffman Algorithm: Step-by-Step Description
See the detailed explanation here:
Huffman Coding: Algorithm Guide
Assignment Requirements
Part 1: Huffman Tree Construction
- Use the provided
HuffmanNode
class - Complete the
build_huffman_tree
function that:- Creates a leaf node for each character
- Uses a priority queue to build the tree
- Returns the root of the Huffman tree
Part 2: Code Generation
- Complete the
generate_codes
function to traverse the Huffman tree and create codes - Store the codes in a dictionary mapping characters to their binary codes
Part 4: Encoding and Decoding
- Implement the
encode
function to convert text to Huffman codes - Implement the
decode
function to convert encoded bits back to text
Testing
The provided code will display the following to help you troubleshoot:
- Character frequency count
- Assigned Huffman codes
- Encoded text
- Decoded text (and whether they match)
- A visualization of your Huffman tree
- Some statistics about the efficiency of the compression
Sample Texts for Testing
Test your code with the following:
Simple test:
hello world
English text:
The quick brown fox jumps over the lazy dog. The five boxing wizards jump quickly. How vexingly quick daft zebras jump!
Repeated patterns:
aaaaaabbbbbccccddeeeeeeeeeeeee
Code snippet:
def hello_world():
print("Hello, World!")
return True
Hints and Tips
Using the heapq
Module
The Python heapq
module provides an efficient implementation of the min-heap data structure, perfect for building Huffman trees:
-
Importing the module:
import heapq
-
Creating and using a heap:
# Create nodes nodes = [HuffmanNode('a', 10), HuffmanNode('b', 5), HuffmanNode('c', 15)] # Convert list to a heap heapq.heapify(nodes) # Optional step when starting with a list # Add an element to the heap heapq.heappush(nodes, HuffmanNode('d', 7)) # Get and remove the smallest element smallest_node = heapq.heappop(nodes)
-
Important note: The
heapq
module requires that elements can be compared. For yourHuffmanNode
class, you need a__lt__
method (less than) to define how nodes should be compared:def __lt__(self, other): # Compare nodes based on frequency return self.freq < other.freq
Other Helpful Tips
- Think recursively when generating codes and traversing the tree
- Test with simple examples first before moving to more complex texts
- Remember to handle edge cases (empty strings, single character texts)
- For debugging, print the tree structure at each step
Bonus Challenges (Optional)
- Add support for compressing/decompressing actual text files
- Compare your implementation with Python’s built-in compression libraries (like
zlib
)