Data compression is a highly important topic in computer
science and information theory. Many situations require the use of coding
schemes for data compression, including:
 Storage of a large file on disk.
 Storage of a large array or other dataset in memory.
 Transmission of a large piece of information over a bandwidthlimited channel such as a phone line.
 JPEG images
 MPEG movies
 data compression algorithms in a modem
In this lecture we limit our focus to Huffman
coding, a variablelength prefix encoding algorithm for compression
of character streams. Although more sophisticated algorithms are now available
and more widely used, Huffman's algorithm has many desirable qualities
and is an interesting application of the use of binary trees.
Bits and bytes
Digital computers store data in binary or base2 format.
A binary digit may have one of two values:
Value  Alternate representations 
0  off, false, 0 volts (TTL) 
1  on, true, 5 volts (TTL) 
A bit is a binary
digit and is the atomic unit in a digital computer.
Most modern computers do not store 1bit numbers, however.
For example, a machine with a 32bit operating system stores addresses
as 32bit words.
A byte is an 8bit number and is typically the smallest
word size on a computer. An 8bit binary number is a base2 representation
of 8 digits. For example,
Longer words (such as 16bit, 32bit, and 64bit words)
are constructed from 8bit bytes. Other commonly used numbering systems
used in computer science are octal (base 8) and hexadecimal (base 16).
Unicode and ASCII
Java stores characters as 16bit unsigned integers according
to the Unicode characterset standard.
The Unicode set stores 2^{16}=65,536 characters and is an international
standard that includes Asian character sets.
All of the letters in the English language, as well as
numbers, punctuation, and control characters are represented in the lowest
7 bits of the Unicode set, with the upper 9 bits set to zero. This subset
of the Unicode set is known as the ASCII standard
and includes 2^{7}=128 characters. All of the characters encountered
in this course will be part of the ASCII standard (see next page).
The following fragment of Java code
char[] charVals = {'m','n','o'};yields the output
for(int i = 0; i < charVals.length; i++)
{
System.out.println("Character '" +charVals[i]
+ "' is stored as " + +(int)charVals[i]
+ " in base10.");
}
Character 'm' is stored as 109
in base10.
Character 'n' is stored as 110 in base10.
Character 'o' is stored as 111 in base10.
Character 'n' is stored as 110 in base10.
Character 'o' is stored as 111 in base10.
ASCII Table
++
 0 NUL 1 SOH 2 STX 3 ETX 4 EOT 5 ENQ 6 ACK 7 BEL
 8 BS  9 HT  10 NL  11 VT  12 NP  13 CR  14 SO  15 SI 
 16 DLE 17 DC1 18 DC2 19 DC3 20 DC4 21 NAK 22 SYN 23 ETB
 24 CAN 25 EM  26 SUB 27 ESC 28 FS  29 GS  30 RS  31 US 
 32 SP  33 !  34 "  35 #  36 $  37 %  38 &  39 ' 
 40 (  41 )  42 *  43 +  44 ,  45   46 .  47 / 
 48 0  49 1  50 2  51 3  52 4  53 5  54 6  55 7 
 56 8  57 9  58 :  59 ;  60 <  61 =  62  63 ? 
 64 @  65 A  66 B  67 C  68 D  69 E  70 F  71 G 
 72 H  73 I  74 J  75 K  76 L  77 M  78 N  79 O 
 80 P  81 Q  82 R  83 S  84 T  85 U  86 V  87 W 
 88 X  89 Y  90 Z  91 [  92 \  93 ]  94 ^  95 _ 
 96 `  97 a  98 b  99 c 100 d 101 e 102 f 103 g 
104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o 
112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w 
120 x 121 y 122 z 123 { 124  125 } 126 ~ 127 DEL
++
Unicode and ASCII are examples of fixedlength
encoding schemes, since all characters require the same amount
of storage (16 bits and 8 bits, respectively).
In contrast, Huffman coding is a
variablelength encoding scheme. The number of bits required
to store a coded character varies according to the relative
frequency or weight of that
character. Although little if any space is saved on infrequent characters,
frequently used characters require very little space (say one, two, or
three bits).
The encoding process is based on a Huffman coding tree,
which is built from the observed frequencies of characters in the document.
The document is scanned, and the number of occurrences of each character
in the document is recorded. Next, a binary tree is built in which the
external nodes store the characters and their frequency in the document.
When the binary tree is built from the document being coded, Huffman's
algorithm optimizes the length of the encoded document.
Prescanning a document and generating a customized Huffman
coding tree may be impractical, however. Often, typical frequencies of
characters in a given context are used instead of the specific frequencies
in a particular document.
Building a Huffman coding tree
Suppose that the following character frequencies are observed
in a string that we wish to encode:
character  C  D  E  F  K  L  U  Z 
frequency  32  42  120  24  7  42  37  2 
The first step is to construct a priority queue and insert
the characterfrequency (elementkey) pairs into it. Assuming that we are
using a sorted sequencebased priority queue, the above table could be
schematically represented by the illustration below.
Step 1:
Building a Huffman coding tree (2)
In step 2, the two items with the lowest keys are removed
from the priority queue. A new binary tree is created with the lowestkey
item as the left external node and the secondlowestkey item as the right
external node. The new tree is then inserted back into the priority queue
Step 2:
This process continues until only one node (tree) is left
in the priority queue.
Step 3:
Building a Huffman coding tree (3)
Step 4:
Step 5:
Building a Huffman coding tree (4)
Final tree (after n=8
steps)
Building a Huffman coding tree (5)
Algorithm Huffman(X):
Input: String X of length n Output: Coding tree for X
Input: String X of length n Output: Coding tree for X
Compute the frequency
f(c)
of each character c in X.
Initialize a priority queue Q.
for each character c in X do
Create a singlenode tree T storing c.
insert T into Q with key f(c).
whileQ.size() > 1 do
f_{1} ¬ Q.minKey()
T_{1} ¬ Q.removeMinElement()
f_{2} ¬ Q.minKey()
T_{2} ¬ Q.removeMinElement()
Create a new tree T with left subtree T_{1} and right subtree T_{2}.
Insert T into Q with key f_{1} + f_{2}.
return tree Q.removeMinElement()
Initialize a priority queue Q.
for each character c in X do
Create a singlenode tree T storing c.
insert T into Q with key f(c).
whileQ.size() > 1 do
f_{1} ¬ Q.minKey()
T_{1} ¬ Q.removeMinElement()
f_{2} ¬ Q.minKey()
T_{2} ¬ Q.removeMinElement()
Create a new tree T with left subtree T_{1} and right subtree T_{2}.
Insert T into Q with key f_{1} + f_{2}.
return tree Q.removeMinElement()
Decoding
A Huffman code is a prefix code, meaning that no code
is the prefix of another. To decode a bit stream from the leftmost bit,
start at the root node of the tree and move to the left child if the bit
is "0" or the right child if the bit is "1". When an external node is reached,
the character it stores is sent to the decoded string. The next bit is
then decoded from the root of the tree.
Encoding
Create a lookup table storing the binary code corresponding
to the path to each letter. This can be done via a preorder traversal.
If the Huffman tree is used to encode ASCII text, a 128element array can
be used for this purpose. For example, the following code fragment can
be used to store the string representing the set of bitwise operations
required to encode the character 'C':
String[] encoder = new String[128];
encoder['C'] = "1110";
encoder['C'] = "1110";





32  1110 


42  110 


120  0 


24  11111 


7  111101 


42  101 


37  100 


2  111100 

Encode the string DEED:
Analysis
Define f_{i} = frequency of character c_{i},
i=1,…,n.
l_{i} = path length
= length of code for c_{i}, i.e., number of bits.
Weighted path length .
Can be shown that a Huffman code is the prefix code that
minimizes the weighted path length.
Average expected cost per character =
Example from table on previous page:
expected cost per character =
A fixedlength encoding on 8 characters would require
3 bits per character.
 The Dictionary ADT
 Implementation with Sorted and Unsorted Sequences
 Binary Search
Adapted from the Goodrich and Tamassia lecture on Searching
The Dictionary ADT
 A dictionary is an abstract model of a database.
 Like a priority queue, a dictionary supports keyelement pairs.
 The main operation supported by a dictionary is searching by key.
 Methods inherited from the SimpleContainer ADT
 size()
 isEmpty()
 Query methods:
 Object findElement(Object k)
 Enumeration findAllElements(Object k)
 Update methods:
 void insertItem(Object k, Object e)
 Object remove(Object k)
 Enumeration removeAll(Object k)
 Special sentinel object:
 NO_SUCH_KEY, returned by an unsuccessful search.
Example Operations on a Dictionary
Operation  Output  Unordered Dictionary 
insertItem(5,A)  {(5,A)}  
insertItem(7,B)  {(5,A),(7,B)}  
insertItem(2,C)  {(5,A),(7,B),(2,C)}  
insertItem(8,D)  {(5,A),(7,B),(2,C),(8,D)}  
findElement(7)  B  {(5,A),(7,B),(2,C),(8,D)} 
findElement(4) 

{(5,A),(7,B),(2,C),(8,D)} 
insertElement(2,E)  {(5,A),(7,B),(2,C),(8,D),(2,E)}  
findAllElements(2)  C,E  {(5,A),(7,B),(2,C),(8,D),(2,E)} 
removeAll(2)  C,E  {(5,A),(7,B),(8,D)} 
Implementing a Dictionary with a Sequence
 Unordered sequence:
 Searching takes O(n) time.
 Inserting takes O(1) time with a doubly linkedlist, O(n) time with an array.
 Ordered sequence
 Searching takes O(log n) time with an array, O(n) time with a linked list.
 Inserting takes O(n) time.
 In the ordered sequence implementation, we can search faster if the sequence is arraybased.
Method 



Array  Doubly Linked List  Array  Doubly Linked List  
size, isEmpty 




findElement 









insertItem 




remove 




removeAll 




 space  




 Narrow down the search range in stages.
 Instead of scanning the sequence we do bisections.
 All candidate items have key(low) £k£ key(high)
 Example: findElement(22)
PseudoCode for Binary Search
Algorithm BinarySearch(S,
k, low, high)
if low
> high then
returnNO_SUCH_KEY
else
mid ¬ë low + highû / 2
ifk = key(mid) then return k else ifk < key(mid)then return BinarySearch(S, k, low, mid1) else
return BinarySearch(S, k, mid+1, high)
mid ¬ë low + highû / 2
ifk = key(mid) then return k else ifk < key(mid)then return BinarySearch(S, k, low, mid1) else
return BinarySearch(S, k, mid+1, high)
 The Dictionary ADT method findElement(k) on a sorted array can be implemented by calling BinarySearch(S, k, 0, S.size()1).
 A constant number of primitive operations is involved at earch recursive level. Thus the running time is proportional the number of recursive calls.
 The size of the candidate list is cut by at least half at each step (i.e., at each recursive level).
















The recursive calls stop when the size of the candidate
list £ 1.
The number of recursive calls is m such that
> Cost of binary search = O(log n).