The Huffman Coding algorithm is used to implement lossless compression. For the purpose of this blog post, we will investigate how this algorithm can be implemented to encode/compress textual information.
The principle of this algorithm is to replace each character (symbols) of a piece of text with a unique binary code. However the codes generated may have different lengths. In order to optimise the compression process, the idea behind the Huffman Coding approach is to associate shorter codes to the most frequently used symbols and longer codes to the less frequently used symbols.
So let’s consider the following message which consists of 20 characters:
Note that the Huffman coding algorithm is normally used to compress larger amount of data and would not really be used to compress such a small message!
Step 1: Frequency Analysis
The first step of the Huffman Coding algorithm is to complete a frequency analysis of this message by:
- Identifying all the symbols used in the message,
- Identifying their weights (or frequencies) by counting their occurrences (the number of times they appear) within the message,
- Sorting this list of symbols in ascending order of their weights.
Step 2: Organising Symbols as a Binary Tree
To do so each symbol becomes a node storing the symbol itself and its weight.
Then the tree is constructed through the following iterative process:
Step 3: Generating the Huffman Codes
Using the above tree, we can now identify the Huffman code for each symbol (leaves of the tree. This is done by highlighting the path from the Root node of the tree to the required leave/symbol. The labels of each branch are concatenated to form the Huffman code for the symbol.
As you can see, most frequent symbols are closer to the root node of the tree, resulting in shorter Huffman codes.
The resulting Huffman Codes for our message are:
Encoding the message
We can now encode the message by replacing each symbol with its matching Huffman code.
Here is a Python implementation of the Huffman Coding algorithm:
To complete the above code, we need an extra method to our HuffmanEncoder class called decode().
This method should take two parameters:
- An encoded message (Binary string)
- the Huffman codes used to encode the message (Using a python dictionary)
Using the above information, the decode() method would then decode the message one character at a time.