Tuesday, 11 March 2014

Variable length encoding & compression

Problem : Variable length encoding & compression
A.   Huffman and other coding schemes tend to devote more bits to the coding of
(A) symbols carrying the most information
(B) symbols carrying the least information
(C) symbols that are likely to be repeated consecutively
(D) symbols containing redundant information
(A) symbols carrying the most information, i.e., the symbols that are less likely to occur. This makes sense: to keep messages as short as possible, frequently occuring symbols should be encoded with fewer bits and infrequent symbols with more bits.
B.    Consider the following two Huffman decoding tress for a variable-length code involving 5 symbols: A, B, C, D and E.
Using Tree #1, decode the following encoded message: "01000111101".
To decode the message, start at the root of the tree and consume digits as you traverse down the tree, stopping when you reach a leaf node. Repeat until all the digits have been processed. Processing the encoded message from left-to-right:
"0" => A
"100" => B
"0" => A
"111" => E
"101" => C
C.    Suppose we were encoding messages that the following probabilities for each of the 5 symbols:
p(A) = 0.5
p(B) = p(C) = p(D) = p(E) = 0.125
Which of the two encodings above (Tree #1 or Tree #2) would yield the shortest encoded messages averaged over many messages?
Using Tree #1, the expected length of the encoding for one symbol is:
1*p(A) + 3*p(B) + 3*p(C) + 3*p(D) + 3*p(E) = 2.0
Using Tree #2, the expected length of the encoding for one symbol is:
2*p(A) + 2*p(B) + 2*p(C) + 3*p(D) + 3*p(E) = 2.25
So using the encoding represented by Tree #1 would yield shorter messages on the average.
D.   Using the probabilities for A, B, C, D and E given above, construct a variable-length binary decoding tree using a simple greedy algorithm as follows:
1.     Begin with the set S of symbols to be encoded as binary strings, together with the probability P(x) for each symbol x. The probabilities sum to 1, and measure the frequencies with which each symbol appears in the input stream. In the example from lecture, the initial set S contains the four symbols and associated probabilities in the above table.
2.     Repeat the following steps until there is only 1 symbol left in S:
A.   Choose the two members of S having lowest probabilities. Choose arbitrarily to resolve ties. In the example aove, D and E might be the first nodes chosen.
B.    Remove the selected symbols from S, and create a new node of the decoding tree whose children (sub-nodes) are the symbols you've removed. Label the left branch with a "0", and the right branch with a "1". In the first iteration of the example above, the bottom-most internal node (leading to D and E) would be created.
C.    Add to S a new symbol (e.g., "DE" in our example) that represents this new node. Assign this new symbol a probability equal to the sum of the probabilities of the two nodes it replaces.
S = {A/0.5 B/0.125 C/0.125 D/0.125 E/0.125}
arbitrarily choose D & E
encoding: "0" => D, "1" => E
S = {A/0.5 B/0.125 C/0.125 DE/0.25}
choose B & C
encoding: "0" => B, "1" => C
S = {A/0.5 BC/0.25 DE/0.25}
choose BC & DE
encoding: "00" => B, "01" => C, "10" => D, "11" => E
S = {A/0.5 BCDE/0.5}
choose A & BCDE
encoding: "0" => A, "100" => B, "101" => C, "110" => D, "111" => E
S = {ABCDE/1.0}
This Tree #1 shown in the diagram above. The choice of D & E as the first symbols to combine was arbitrary -- we could have chosen any two symbols from B, C, D and E. So there are many equally plausible encodings that might emerge from this algorithm, corresponding the interchanging B, C, D and E at the leaves of the tree.
E.    Huffman coding is used to compactly encode the species of fish tagged by a game warden. If 50% of the fish are bass and the rest are evenly distributed among 15 other species, how many bits would be used to encode the species of a bass?
1 bit, using the algorithm described above.
F.     Consider the sum of two six-sided dice. Even when the dice are "fair" the amount information conveyed by a single sum depends on what the sum is since some sums are more likely than others, as shown in the following figure:
What is the average number of bits of information provided by the sum of 2 dice? Suppose we want to transmit the sums resulting from rolling the dice 1000 times. How many bits should we expect that transmission to take?