Tuesday 2 July 2013

Sets, Hashing

What Is a Set?
  • A set is a data structure modeled after the mathematical notation of a set. The fundamental set operations are union, intersection, and subtraction.
  • A brief aside on mathematical set notation:
  • Elements of a set may not be repeated.
  • The operations of a Set ADT do not require a total order relationship among its elements. One of the simplest and most versatile ways to implement a set of elements, however, is via an ordered sequence. The elements of a set will almost always lend themselves to a total order.

Set ADT
  • Methods of the Set ADT (invoked on Set A):
    • size():
    • Return the number of elements in set A.
      Input: none; Output: int
    • isEmpty():
            • Returns true if the set if empty, false if not.
              Input: none; Output: boolean
    • insertElement(e):
            • Insert the element e into the set A, unless e is already in A.
              Input: Object; Output: none
    • elements():
            • Return an iterator of the elements in set A.
              Input: None; Output: Iterator
    • isMember(e):
            • Return true if e is in set A, false if not.
              Input: Object; Output: Boolean

The Set ADT (Cont.)
    • union(B):
            • Return A È B. Nondestructive.
              Input: Set; Output: Set.
    • intersect(B):
            • Return A Ç B. Nondestructive.
              Input: Set; Output: Set
    • subtract(B):
            • Return AB. Nondestructive.
              Input: Set; Output: Set
    • isEqual(B):
            • Return true iff set A = set B.
              Input: Set; Output: boolean.



Template Method Pattern
  • We can implement set operations such as union, intersection, and subtraction by specializing the generic merge algorithm.
  • Recall that in the template method pattern, an abstract class defines a principal method that calls a number of auxiliary methods. In the abstract class, the auxiliary methods are abstract (i.e., empty). The concrete subclasses override the auxiliary methods but not the principal method.
  • In this application of the template method pattern, the abstract class is Merger and the concrete subclasses are UnionMerger, IntersectMerger, and SubtractMerger.
  • The methods union, intersect, and subtract of the Set ADT can be implemented via the corresponding concrete subclass of Merger.
  • For example, the union(B) method would instantiate an object of class UnionMerger, call merge(this, B, C, comp) on that object, and return C.

Abstract Class Merger
public abstract class Merger
    {
    // abstract methods to be overridden
    protected abstract void firstIsLess(Object a, Sequence C);
    protected abstract void bothAreEqual(Object a, Object b, Sequence C);
    protected abstract void firstIsGreater(Object a, Sequence C);
    private Object a, b; // store current elements for comparison
    public void merge(Sequence A, Sequence B, Sequence C, Comparator comp)
    {
        // use iterators to meet the non-destructive  criterion
        Iterator iterA = A.elements(), iterB = B.elements();
        boolean aExists = advanceA(iterA),
        bExists = advanceB(iterB);
        while (aExists && bExists)
        {
            if(comp.isLessThan(a,b))
            {
                firstIsLess(a,C);
                aExists = advanceA(iterA);
            }
            else if (comp.isEqualTo(a,b))
            {
                bothAreEqual(a,b,C);
                aExists = advanceA(iterA);
                bExists = advanceB(iterB);
            }
            else
            {
                firstIsGreater(b,C);
                bExists = advanceB(iterB);
            }
        }
        while(aExists)
        {
            firstIsLess(a,C);
            aExists = advanceA(iterA);
        }
        while(bExists)
        {
            firstIsGreater(b,C);
            bExists = advanceB(iterB);
        }
    }

Abstract Class Merger (cont.)
    private boolean advanceA(Iterator iterA)
    {
        if(iterA.hasNext())
        {
            a = iterA.next();
            return true;
        }
        else
        {
            a = null; return false;
        }
    }
    private boolean advanceB(Iterator iterB)
    {
        if(iterB.hasNext())
        {
            b = iterB.next();
            return true;
        }
        else
        {
            b = null; return false;
        }
    }
}

Class UnionMerger
public class UnionMerger extends Merger
{
    protected void firstIsLess(Object a, Sequence C)
    {
        C.insertLast(a);
    }
    protected void bothAreEqual(Object a, Object b, Sequence C)
    {
        C.insertLast(a);
    }
    protected void firstIsGreater(Object b, Sequence C)
    {
        C.insertLast(b);
    }
}
 
 


Class IntersectMerger






public class IntersectMerger extends Merger
{
    protected void firstIsLess(Object a, Sequence C)
    {} // empty method
    protected void bothAreEqual(Object a, Object b, Sequence C)
    {
        C.insertLast(a);
    }
    protected void firstIsGreater(Object b, Sequence C)
    {} // empty method
}

Class SubtractMerger
public class SubtractMerger extends Merger
{
    protected void firstIsLess(Object a, Sequence C)
    {
        C.insertLast(a);
    }
    protected void bothAreEqual(Object a, Object b, Sequence C)
    {}   // empty method
    protected void firstIsGreater(Object b, Sequence C)
    {}    // empty method
}

Example of Set Operations
A={1,3,4,6,8}
B={2,3,4,5,6}
 
a b Resultant set C when operation is
union intersection subtraction
1 2 {1} {} {1}
3 2 {1,2} {} {1}
3 3 {1,2,3} {3} {1}
4 4 {1,2,3,4} {3,4} {1}
6 5 {1,2,3,4,5} {3,4} {1}
6 6 {1,2,3,4,5,6} {3,4,6} {1}
8 N/A {1,2,3,4,5,6,8} {3,4,6} {1,8}

Complexity of Set Methods
methods
time
size, isEmpty
O(1)
insertElement
O(n)
elements, isMember
O(n)
union, intersect, subtract
O(n)
isEqual
O(n)


Lecture 12B
Adapted from the Goodrich and Tamassia lecture on Radix-Sort

Radix Sort
  • Unlike other sorting algorithms, radix-sort considers the structure of the keys.
  • Assume that keys are represented in a base N numbering system where N is known as the radix, For example, if N = 2, the keys are respresented by binary numbers.
  • For the primitive int type, b = 32.
  • Sorting is done by looking at bits in the same position.  No comparisons are needed.
  • This idea can be extended to decimal integers with a fixed number of digits or to keys that are ASCII or Unicode strings.

Radix Exchange Sort
Examine the bits from left to right:
  • Sort array with respect to the leftmost bit.
  • Partition array
  • Recurse
  • recursively sort the top subarray, ignoring leftmost bit.
  • recursively sort the bottom subarray, ignoring the leftmost bit.
Time: O(bn)
Radix Exchange Sort
How do we do the sort on a particular bit position? Same idea as the partition step in quick-sort.
repeat scan top-down to find next key starting with 1
scan bottom-up to find next key starting with 0
exchange keys
until scan indices cross

Radix Exchange Sort

array before sort array after sort on leftmost bit array after recursive sort on second-from-leftmost bit


Radix-Exchange Sort vs. Quick-Sort
  • Similarities:
    • Both partition the array
    • Both recursively sort on sub-arrays.
  • Differences:
    • Method of partitioning
      • Radix-exchange divides array based on greater than or less than 2i-1. No comparisons are performed.
      • Quick-sort partitions based on greater than or less than some element of the array. Requires comparisons.
    • Time complexity
      • Radix-exchange O(bn)
      • Quicksort, average case O(n log n)
      • Quickwort, worst case O(n2)

Straight Radix Sort
Examines bits from right to left
for k ¬ 0 to b-1 do
sort the array in a stable way, looking only at bit k.
Note the order of the affected bits after sorting.

What does it mean to "sort in a stable way" ??!!
In a stable sort, the initial relative order of equal keys is unchanged.
For example, observe the first step of the sort from the previous page (i.e., the sort on the rightmost bit):
Note that the relative order of those keys ending with 0 is unchanged, and the same is true for elements ending in 1.

The Algorithm is Correct (right?)
  • We show that any two keys are in the correct relative order at the end of the algorithm.
  • Given two keys, let k be the leftmost bit-position where they differ.
  • At step k, the two keys are put in the correct relative order.
  • Because of stability, successive steps do not change the relative order of the two keys.

Stability (cont.)
Consider a sort on an array with these two keys:

It makes no difference what order the two keys are in when the sort begins. When the sort visits bit k, the keys are put in the correct relative order. Because the sort is stable, the order of the two keys will not be changed when bits greater than k are compared.


Decimal Numbers
  • Radix sorting can be applied to decimal numbers that have a fixed or maximum number of digits.
    Note the order of the affected bits after sorting.

 
 
 

Analysis of Straight Radix Sort
for k ¬ 0 to b-1 do sort the array in a stable way, looking only at digit k.
  • Suppose that we can perform the stable sort on a single digit in O(n) time. Then the total time complexity would be
O(bn).
  • Indeed, we can perform a stable sort on the key’s kth digit in O(n) time.
  • The embedded algorithm is known as bucket sort.

    Bucket Sort
      • n numbers
      • Each number 
      • stable
      • time complexity: O(n + N)
  • As an example, consider the case where N = 3 and let our array be (kth digit shown only)
  • (note that there are two "0"s and two "1"s.)
    • First, we create N "buckets"


      Bucket Sort (cont.)
    • Each element of the array is put into one of the N "buckets."

 
 
Bucket Sort (cont.)
    • Now, pull the elements from the buckets into the array.
At last, the sorted array (sorted in a stable way):


Cost of Straight Radix Sort
O(b(n+N))
where N = number of buckets in bucket sort.
 
 
representation N
binary 2
decimal 10
hexidecimal 16
ASCII 128



 
 
What is it?
A review of familiar material?
A side order for your eggs?
A combination of the two?




Adapted from the Goodrich and Tamassia lecture on Hashing

Dictionary Problem
  • A phone company wants to provide caller ID capability:
    • Given a phone number, return the caller’s name.
    • Phone numbers are in the range 0 £ r £  R where here R = 107 - 1.
    • They want to do this as efficiently as possible.
  • A few suboptimal ways to design this dictionary:
    • an array indexed by a key: takes O(1) time, O(n + R) – a huge amount of wasted space.
    • a linked list: Takes O(n) time, O(n) space
    • A balanced binary tree: O(log n) time, O(n) space.

A Better Solution
  • We can do better with a hashtable – O(1) expected time, O(n + N) space, where N is the size of the hash table.
  • We need a function to map the large range of keys into a smaller range of table indices.
    • e.g., h(K) = K mod N
  • Insert {402-3045, "CAJ (w)"} into a hashed array with, say, N = 5 slots.
    • 4023045 mod 5 = 0, so {402-3045, "CAJ (w)"} goes in slot 0 of the hash table.
  • A lookup uses the same process: hash the query key, then check the array at that slot.
  • Insert {428-7971, "CAJ (h)"}

Collision Resolution
  • How do we deal with two keys that hash to the same index in the array?
  • One policy is to use separate chaining
    • Set up an array (i.e., table) of links, indexed by the keys, to sequences of items with the same key.
    • Separate chaining is the most time-wise efficient collision resolution policy. Other policies are more efficient space-wise.
Lafore Separate Chaining Applet

Pseudo-code for Dictionary ADT Methods
  • Constructor
    • Create a table of N sequences.
  • insertItem(k, e)
  • index ¬ h(k)
    insert e into sequence at table[index]
  • findElement(k)
  • index ¬ h(k)
    step through sequence at table[index], looking for a match.
    if search was successful, return found element
    else, return NO_SUCH_KEY
  • remove(k)
  • index ¬ h(k)
    step through sequence at table[index], looking for a match.
    if search was successful,
        remove found item from sequence
        return found element
    else, return NO_SUCH_KEY

Hash Functions
  • A good hash function:
    • is quick to compute, and
    • distributes the keys uniformly throughout the table.
  • Dealing with non-integer keys:
    • find some way of converting the keys into integers.
      • If the key is a string, we could add the ASCII values of the characters.
    • Then use a standard hash function for integers.
  • A common choice for the hash function is
h(k) = k mod N.
  • Choosing N, the size of the table:
    • N = be (bad)
      • If N is a power of 2, for example, h(k) will give the e least-significant bits of k.
      • All keys with the same ending will go to the same place.
    • Prime number N (good)
      • helps ensure uniform distribution of keys.
    • Mid-square
      • h(k) = middle digits of k2.
      • sometimes it is impractical to use a table whose size is a prime number.
      • This hash function can be used to reduce the probability of collisions when N = be.
      • e.g., table size is a power of 10
h(4150130) = 21526 4436 17100 h(415013034) = 526447 3522 151420
h(1150130) = 13454 2361 7100
      • e.g., table size is a power of 2
h(1001) = 10 100 01 h(1011) = 11 110 01
h(1101) = 101 010 01

Open Addressing Collision Resolution
  • Open addressing uses only an array. The array slots store the items directly.
  • Uses less memory than separate chaining.
    • Don’t need to store links, sentinels, etc.
    • Fewer empty array slots. Clusters spread through the array.
  • To lookup a key, start at hashed index probe, then step through table until either the key is found or an empty slot is found.
  • Slower than separate chaining. The lookup may have to step through many table indices before the key is found.
  • Removing an item poses difficulties. Either mark the deleted slot as empty or fill in the slot by shifting some elements downward.
  • Specific open-addressing resolution policies include:
    • Linear probing
    • Quadratic probing
    • Double hashing

Linear Probing
  • Linear probing is the simplest of the open addressing policies.
  • If the current slot is already being used, just try the next slot.
Algorithm linearProbingInsert(k, e)
Input: Key k, element e
    if(table is full)
        error
    probe¬ h(k)
    while table[probe] is occupied
        probe ¬ (probe + 1) mod N
    table[probe] ¬ (k,e)

Linear Probing Example
  • Table size N = 13.
  • h(k) = k mod 13
  • Insert the keys
18  41  22  44  59  32  31  73
 
             0    1    2   3    4    5   6   7    8    9  10  11  12
 
  Lafore Linear Probing Applet


  Double Hashing
  • Uses two hash functions, h1(k) and h2(k).
    • Typically h1(k) = k mod N.
    • Typically, h2(k) = q – (k mod q),
where q is a prime number and q < N.
  • If N is prime, all slots in the table will eventually be examined.
  • Many of the same (dis)advantages as linear probing. Space-efficient but slow compared with separate chaining.
  • Distributes keys more uniformly than linear probing.
Algorithm doubleHashInsert(k,e)
Input: key k, element e
    if (table is full)
        error
    probe¬ h1(k)
    offset¬ h2(k)
    while (table[probe] is occupied)
        probe ¬ (probe + offset) mod N
    table[probe] ¬ (k,e)

Double Hashing Example
  • Table size N = 13
  • h1(k) = k mod 13
  • h2(k) = 7 – k mod 7
  • Keys to be inserted:
18  41  22  44  59  32  31  73
 
  0    1    2   3    4    5   6   7    8    9  10  11  12



Lafore Double/Quadratic Probing Applet

Theoretical Results
  • The load factor a is the average number of keys per array index:
a = n/N.
  • The analysis is probabilistic rather than worst-case.
Expected number of probes in a search

 
not found
found
chaining
linear probing
double hashing


Expected Number of Probes in a Search vs. Load Factor

No comments:

Post a Comment