Real Computer Science begins where we almost stop reading ...: Huffman Coding & Dictonaries

Tuesday, 2 July 2013

Huffman Coding & Dictonaries

Motivation: Compression and data coding

Data compression is a highly important topic in computer science and information theory. Many situations require the use of coding schemes for data compression, including:

Storage of a large file on disk.
Storage of a large array or other dataset in memory.
Transmission of a large piece of information over a bandwidth-limited channel such as a phone line.

Many coding schemes are in common use in today's computer systems:

JPEG images
MPEG movies
data compression algorithms in a modem

Data compression is an example of the space-time tradeoff. In order to save storage costs, we expend additional CPU time in running the compression algorithm.

In this lecture we limit our focus to Huffman coding, a variable-length prefix encoding algorithm for compression of character streams. Although more sophisticated algorithms are now available and more widely used, Huffman's algorithm has many desirable qualities and is an interesting application of the use of binary trees.

Bits and bytes

Digital computers store data in binary or base-2 format. A binary digit may have one of two values:

Value	Alternate representations
0	off, false, 0 volts (TTL)
1	on, true, 5 volts (TTL)

A bit is a binary digit and is the atomic unit in a digital computer.

Most modern computers do not store 1-bit numbers, however. For example, a machine with a 32-bit operating system stores addresses as 32-bit words.

A byte is an 8-bit number and is typically the smallest word size on a computer. An 8-bit binary number is a base-2 representation of 8 digits. For example,

Longer words (such as 16-bit, 32-bit, and 64-bit words) are constructed from 8-bit bytes. Other commonly used numbering systems used in computer science are octal (base 8) and hexadecimal (base 16).

Unicode and ASCII

Java stores characters as 16-bit unsigned integers according to the Unicode character-set standard. The Unicode set stores 2¹⁶=65,536 characters and is an international standard that includes Asian character sets.

All of the letters in the English language, as well as numbers, punctuation, and control characters are represented in the lowest 7 bits of the Unicode set, with the upper 9 bits set to zero. This subset of the Unicode set is known as the ASCII standard and includes 2⁷=128 characters. All of the characters encountered in this course will be part of the ASCII standard (see next page).

The following fragment of Java code

char[] charVals = {'m','n','o'};
for(int i = 0; i < charVals.length; i++)
{
    System.out.println("Character '" +charVals[i]
        + "' is stored as " + +(int)charVals[i]
        + " in base-10.");
}

yields the output

Character 'm' is stored as 109 in base-10.
Character 'n' is stored as 110 in base-10.
Character 'o' is stored as 111 in base-10.

ASCII Table

+---------------------------------------------------------------+
|  0 NUL|  1 SOH|  2 STX|  3 ETX|  4 EOT|  5 ENQ|  6 ACK|  7 BEL|
|  8 BS |  9 HT | 10 NL | 11 VT | 12 NP | 13 CR | 14 SO | 15 SI |
| 16 DLE| 17 DC1| 18 DC2| 19 DC3| 20 DC4| 21 NAK| 22 SYN| 23 ETB|
| 24 CAN| 25 EM | 26 SUB| 27 ESC| 28 FS | 29 GS | 30 RS | 31 US |
| 32 SP | 33  ! | 34  " | 35  # | 36  $ | 37  % | 38  & | 39  ' |
| 40  ( | 41  ) | 42  * | 43  + | 44  , | 45  - | 46  . | 47  / |
| 48  0 | 49  1 | 50  2 | 51  3 | 52  4 | 53  5 | 54  6 | 55  7 |
| 56  8 | 57  9 | 58  : | 59  ; | 60  < | 61  = | 62    | 63  ? |
| 64  @ | 65  A | 66  B | 67  C | 68  D | 69  E | 70  F | 71  G |
| 72  H | 73  I | 74  J | 75  K | 76  L | 77  M | 78  N | 79  O |
| 80  P | 81  Q | 82  R | 83  S | 84  T | 85  U | 86  V | 87  W |
| 88  X | 89  Y | 90  Z | 91  [ | 92  \ | 93  ] | 94  ^ | 95  _ |
| 96  ` | 97  a | 98  b | 99  c |100  d |101  e |102  f |103  g |
|104  h |105  i |106  j |107  k |108  l |109  m |110  n |111  o |
|112  p |113  q |114  r |115  s |116  t |117  u |118  v |119  w |
|120  x |121  y |122  z |123  { |124  | |125  } |126  ~ |127 DEL|
+---------------------------------------------------------------+

Variable-length encoding

Unicode and ASCII are examples of fixed-length encoding schemes, since all characters require the same amount of storage (16 bits and 8 bits, respectively).

In contrast, Huffman coding is a variable-length encoding scheme. The number of bits required to store a coded character varies according to the relative frequency or weight of that character. Although little if any space is saved on infrequent characters, frequently used characters require very little space (say one, two, or three bits).

The encoding process is based on a Huffman coding tree, which is built from the observed frequencies of characters in the document. The document is scanned, and the number of occurrences of each character in the document is recorded. Next, a binary tree is built in which the external nodes store the characters and their frequency in the document. When the binary tree is built from the document being coded, Huffman's algorithm optimizes the length of the encoded document.

Pre-scanning a document and generating a customized Huffman coding tree may be impractical, however. Often, typical frequencies of characters in a given context are used instead of the specific frequencies in a particular document.

Building a Huffman coding tree

Suppose that the following character frequencies are observed in a string that we wish to encode:

character	C	D	E	F	K	L	U	Z
frequency	32	42	120	24	7	42	37	2

The first step is to construct a priority queue and insert the character-frequency (element-key) pairs into it. Assuming that we are using a sorted sequence-based priority queue, the above table could be schematically represented by the illustration below.

Step 1:

Building a Huffman coding tree (2)

In step 2, the two items with the lowest keys are removed from the priority queue. A new binary tree is created with the lowest-key item as the left external node and the second-lowest-key item as the right external node. The new tree is then inserted back into the priority queue

Step 2:

This process continues until only one node (tree) is left in the priority queue.

Step 3:

Building a Huffman coding tree (3)

Step 4:

Step 5:

Building a Huffman coding tree (4)

Final tree (after n=8 steps)

Building a Huffman coding tree (5)

Algorithm Huffman(X):
Input: String X of length n Output: Coding tree for X

    Compute the frequency f(c) of each character c in X.
    Initialize a priority queue Q.
    for each character c in X do
        Create a single-node tree T storing c.
        insert T into Q with key f(c).
    whileQ.size() > 1 do
        f₁ ¬ Q.minKey()
        T₁ ¬ Q.removeMinElement()
        f₂ ¬ Q.minKey()
        T₂ ¬ Q.removeMinElement()
        Create a new tree T with left subtree T₁ and right subtree T₂.
        Insert T into Q with key f₁ + f₂.
    return tree Q.removeMinElement()

Decoding

A Huffman code is a prefix code, meaning that no code is the prefix of another. To decode a bit stream from the leftmost bit, start at the root node of the tree and move to the left child if the bit is "0" or the right child if the bit is "1". When an external node is reached, the character it stores is sent to the decoded string. The next bit is then decoded from the root of the tree.

Decode 1011001110111101

Encoding

Create a lookup table storing the binary code corresponding to the path to each letter. This can be done via a preorder traversal. If the Huffman tree is used to encode ASCII text, a 128-element array can be used for this purpose. For example, the following code fragment can be used to store the string representing the set of bit-wise operations required to encode the character 'C':

String[] encoder = new String[128];
encoder['C'] = "1110";

Character	frequency	code	# bits
C	32	1110	4
D	42	110	3
E	120	0	1
F	24	11111	5
K	7	111101	6
L	42	101	3
U	37	100	3
Z	2	111100	6

Encode the string DEED:

Analysis

Define f_i = frequency of character c_i, i=1,…,n.

l_i = path length = length of code for c_i, i.e., number of bits.

Weighted path length

Can be shown that a Huffman code is the prefix code that minimizes the weighted path length.

Average expected cost per character =

Example from table on previous page:

expected cost per character =

» 2.57.

A fixed-length encoding on 8 characters would require 3 bits per character.

Dictionaries

The Dictionary ADT
Implementation with Sorted and Unsorted Sequences
Binary Search

Adapted from the Goodrich and Tamassia lecture on Searching

The Dictionary ADT

A dictionary is an abstract model of a database.

Like a priority queue, a dictionary supports key-element pairs.

The main operation supported by a dictionary is searching by key.

Methods inherited from the SimpleContainer ADT

size()
isEmpty()

Query methods:

Object findElement(Object k)
Enumeration findAllElements(Object k)

Update methods:

void insertItem(Object k, Object e)
Object remove(Object k)
Enumeration removeAll(Object k)

Special sentinel object:

NO_SUCH_KEY, returned by an unsuccessful search.

Interface SimpleDictionary

Example Operations on a Dictionary

Operation	Output	Unordered Dictionary
insertItem(5,A)		{(5,A)}
insertItem(7,B)		{(5,A),(7,B)}
insertItem(2,C)		{(5,A),(7,B),(2,C)}
insertItem(8,D)		{(5,A),(7,B),(2,C),(8,D)}
findElement(7)	B	{(5,A),(7,B),(2,C),(8,D)}
findElement(4)	NO_SUCH_KEY	{(5,A),(7,B),(2,C),(8,D)}
insertElement(2,E)		{(5,A),(7,B),(2,C),(8,D),(2,E)}
findAllElements(2)	C,E	{(5,A),(7,B),(2,C),(8,D),(2,E)}
removeAll(2)	C,E	{(5,A),(7,B),(8,D)}

Implementing a Dictionary with a Sequence

Unordered sequence:

Searching takes O(n) time.
Inserting takes O(1) time with a doubly linked-list, O(n) time with an array.

Ordered sequence

Searching takes O(log n) time with an array, O(n) time with a linked list.
Inserting takes O(n) time.

In the ordered sequence implementation, we can search faster if the sequence is array-based.

Cost of Sequence Dictionary Operations

Method	Unsorted Sequence		Sorted Sequence
	Array	Doubly Linked List	Array	Doubly Linked List
size, isEmpty	O(1)	O(1)	O(1)	O(1)
findElement	O(n)	O(n)	O(log n)	O(n)
findAllElements	Q(n)	Q(n)	O(log n + s)	O(n)
insertItem	O(1)	O(1)	O(n)	O(n)
remove	O(n)	O(n)	O(n)	O(n)
removeAll	Q(n)	Q(n)	O(n)	O(n)
-- space --	O(N)	O(n)	O(N)	O(n)

Abstract Class java.util.Dictionary

Binary Search

Narrow down the search range in stages.

Instead of scanning the sequence we do bisections.

All candidate items have key(low) £k£ key(high)

Example: findElement(22)

Pseudo-Code for Binary Search

Algorithm BinarySearch(S, k, low, high)     if low > high then         returnNO_SUCH_KEY     else
        mid ¬ë low + highû / 2
        ifk = key(mid) then             return k         else ifk < key(mid)then             return BinarySearch(S, k, low, mid-1)         else
            return BinarySearch(S, k, mid+1, high)

The Dictionary ADT method findElement(k) on a sorted array can be implemented by calling BinarySearch(S, k, 0, S.size()-1).

Analysis of Binary Search

A constant number of primitive operations is involved at earch recursive level. Thus the running time is proportional the number of recursive calls.
The size of the candidate list is cut by at least half at each step (i.e., at each recursive level).

comparison	size of candidate list
0	n
1	£ n/2
2	£ n/4
…	…
i	£ n/2ⁱ
…	…
log₂n	£ 1

The recursive calls stop when the size of the candidate list £ 1. The number of recursive calls is m such that

--> Cost of binary search = O(log n).

No comments:

Inspiring Quotes

An inspiring quote may be just what you need to turn your day around. Here are some of the most inspiring quotes ever spoken or written.

I hated every minute of training, but I said, “Don’t quit. Suffer now and live the rest of your life as a champion.”

–Muhammad Ali

“You can have anything you want if you are willing to give up the belief that you can’t have it.”
–Robert Anthony

“There is no man living that can not do more than he thinks he can.”

–Henry Ford

“The best way to predict the future is to create it.”

–Dr. Forrest C. Shaklee

“It’s not about time, it’s about choices. How are you spending your choices?”

–Beverly Adamo

“Success…seems to be connected with action. Successful people keep moving. They make mistakes, but they don’t quit.”
–Conrad Hilton

“Destiny is not a matter of chance; it’s a matter of choice.”

–Anonymous

“The future belongs to those who believe in the beauty of their dreams.”
–Eleanor Roosevelt

“The quality of a person’s life is in direct proportion to their commitment to excellence, regardless of their chosen field of endeavor.”
–Vince Lombardi

“It is never too late to be what you might have been.”
–George Eliot

“Do not let what you can not do; interfere with what you can do.”
–John Wooden

“One man with courage makes a majority.”
–Andrew Jackson

“Failure is the opportunity to begin again more intelligently.”
–Henry Ford

“Try not to become a man of success but rather try to become a man of value.”
–Albert Einstein

“The mind is its own place, and in itself can make a heaven of Hell, a hell of Heaven.”

–John Milton

"If u are student, working and preparing give a little extra effort after regular work. A small sacrifice of TV time, fun time, or facebook time can bring a lot of better things to life than you ever imagined."

-- Naam likhna jaroori nai samajhta.

Thank you for reading, be sure to pass this along!

Tuesday, 2 July 2013

Huffman Coding & Dictonaries

No comments:

Post a Comment