Lossless Compression: Huffman Coding Algorithm

The Huffman Coding algorithm is used to implement lossless compression. For the purpose of this blog post, we will investigate how this algorithm can be implemented to encode/compress textual information.

The principle of this algorithm is to replace each character (symbols) of a piece of text with a unique binary code. However the codes generated may have different lengths. In order to optimise the compression process, the idea behind the Huffman Coding approach is to associate shorter codes to the most frequently used symbols and longer codes to the less frequently used symbols.

So let’s consider the following message which consists of 20 characters:
huffman-coding-message

Note that the Huffman coding algorithm is normally used to compress larger amount of data and would not really be used to compress such a small message!

Step 1: Frequency Analysis

The first step of the Huffman Coding algorithm is to complete a frequency analysis of this message by:

  • Identifying all the symbols used in the message,
  • Identifying their weights (or frequencies) by counting their occurrences (the number of times they appear) within the message,
  • Sorting this list of symbols in ascending order of their weights.

huffman-coding-frequency-analysis

Step 2: Organising Symbols as a Binary Tree

To do so each symbol becomes a node storing the symbol itself and its weight.
Then the tree is constructed through the following iterative process:
huffman-coding-tree

Step 3: Generating the Huffman Codes

Using the above tree, we can now identify the Huffman code for each symbol (leaves of the tree. This is done by highlighting the path from the Root node of the tree to the required leave/symbol. The labels of each branch are concatenated to form the Huffman code for the symbol.
huffman-coding-path

As you can see, most frequent symbols are closer to the root node of the tree, resulting in shorter Huffman codes.

The resulting Huffman Codes for our message are:
huffman-coding-codes

Encoding the message

We can now encode the message by replacing each symbol with its matching Huffman code.
huffman-coding-encoded-message-codes

Encoded/compressed message:

010100111101000011111011010111100001101111010000111100101000110

Python Implementation

Here is a Python implementation of the Huffman Coding algorithm:

Python Task

To complete the above code, we need an extra method to our HuffmanEncoder class called decode().

This method should take two parameters:

  • An encoded message (Binary string)
  • the Huffman codes used to encode the message (Using a python dictionary)

Using the above information, the decode() method would then decode the message one character at a time.

Share Button
Tagged with: