Real Computer Science begins where we almost stop reading ...: Sets, Hashing

Tuesday, 2 July 2013

Sets, Hashing

What Is a Set?

A set is a data structure modeled after the mathematical notation of a set. The fundamental set operations are union, intersection, and subtraction.

A brief aside on mathematical set notation:

Elements of a set may not be repeated.

The operations of a Set ADT do not require a total order relationship among its elements. One of the simplest and most versatile ways to implement a set of elements, however, is via an ordered sequence. The elements of a set will almost always lend themselves to a total order.

Set ADT

Methods of the Set ADT (invoked on Set A):

size():

Return the number of elements in set A.
Input: none; Output: int

isEmpty():

Returns true if the set if empty, false if not.

Input: none; Output: boolean

insertElement(e):

Insert the element e into the set A, unless e is already in A.

Input: Object; Output: none

elements():

Return an iterator of the elements in set A.

Input: None; Output: Iterator

isMember(e):

Return true if e is in set A, false if not.

Input: Object; Output: Boolean

The Set ADT (Cont.)

union(B):

Return A È B. Nondestructive.

Input: Set; Output: Set.

intersect(B):

Return A Ç B. Nondestructive.

Input: Set; Output: Set

subtract(B):

Return A – B. Nondestructive.

Input: Set; Output: Set

isEqual(B):

Return true iff set A = set B.

Input: Set; Output: boolean.

Template Method Pattern

We can implement set operations such as union, intersection, and subtraction by specializing the generic merge algorithm.

Recall that in the template method pattern, an abstract class defines a principal method that calls a number of auxiliary methods. In the abstract class, the auxiliary methods are abstract (i.e., empty). The concrete subclasses override the auxiliary methods but not the principal method.

In this application of the template method pattern, the abstract class is Merger and the concrete subclasses are UnionMerger, IntersectMerger, and SubtractMerger.

The methods union, intersect, and subtract of the Set ADT can be implemented via the corresponding concrete subclass of Merger.

For example, the union(B) method would instantiate an object of class UnionMerger, call merge(this, B, C, comp) on that object, and return C.

Abstract Class Merger public abstract class Merger
    {
    // abstract methods to be overridden
    protected abstract void firstIsLess(Object a, Sequence C);
    protected abstract void bothAreEqual(Object a, Object b, Sequence C);
    protected abstract void firstIsGreater(Object a, Sequence C);
    private Object a, b; // store current elements for comparison
    public void merge(Sequence A, Sequence B, Sequence C, Comparator comp)
    {
        // use iterators to meet the non-destructive criterion
        Iterator iterA = A.elements(), iterB = B.elements();
        boolean aExists = advanceA(iterA),
        bExists = advanceB(iterB);
        while (aExists && bExists)
        {
            if(comp.isLessThan(a,b))
            {
                firstIsLess(a,C);
                aExists = advanceA(iterA);
            }
            else if (comp.isEqualTo(a,b))
            {
                bothAreEqual(a,b,C);
                aExists = advanceA(iterA);
                bExists = advanceB(iterB);
            }
            else
            {
                firstIsGreater(b,C);
                bExists = advanceB(iterB);
            }
        }
        while(aExists)
        {
            firstIsLess(a,C);
            aExists = advanceA(iterA);
        }
        while(bExists)
        {
            firstIsGreater(b,C);
            bExists = advanceB(iterB);
        }
    }

Abstract Class Merger (cont.)     private boolean advanceA(Iterator iterA)
    {
        if(iterA.hasNext())
        {
            a = iterA.next();
            return true;
        }
        else
        {
            a = null; return false;
        }
    }
    private boolean advanceB(Iterator iterB)
    {
        if(iterB.hasNext())
        {
            b = iterB.next();
            return true;
        }
        else
        {
            b = null; return false;
        }
    }
}

Class UnionMerger public class UnionMerger extends Merger
{
    protected void firstIsLess(Object a, Sequence C)
    {
        C.insertLast(a);
    }
    protected void bothAreEqual(Object a, Object b, Sequence C)
    {
        C.insertLast(a);
    }
    protected void firstIsGreater(Object b, Sequence C)
    {
        C.insertLast(b);
    }
}

Class IntersectMerger

public class IntersectMerger extends Merger
{
    protected void firstIsLess(Object a, Sequence C)
    {} // empty method
    protected void bothAreEqual(Object a, Object b, Sequence C)
    {
        C.insertLast(a);
    }
    protected void firstIsGreater(Object b, Sequence C)
    {} // empty method
}

Class SubtractMerger public class SubtractMerger extends Merger
{
    protected void firstIsLess(Object a, Sequence C)
    {
        C.insertLast(a);
    }
    protected void bothAreEqual(Object a, Object b, Sequence C)
    {}   // empty method
    protected void firstIsGreater(Object b, Sequence C)
    {}    // empty method
}

Example of Set Operations A={1,3,4,6,8}
B={2,3,4,5,6}

a	b	Resultant set C when operation is
a	b	union	intersection	subtraction
1	2	{1}	{}	{1}
3	2	{1,2}	{}	{1}
3	3	{1,2,3}	{3}	{1}
4	4	{1,2,3,4}	{3,4}	{1}
6	5	{1,2,3,4,5}	{3,4}	{1}
6	6	{1,2,3,4,5,6}	{3,4,6}	{1}
8	N/A	{1,2,3,4,5,6,8}	{3,4,6}	{1,8}

Complexity of Set Methods

methods	time
size, isEmpty	O(1)
insertElement	O(n)
elements, isMember	O(n)
union, intersect, subtract	O(n)
isEqual	O(n)

Lecture 12B

Adapted from the Goodrich and Tamassia lecture on Radix-Sort

Radix Sort

Unlike other sorting algorithms, radix-sort considers the structure of the keys.

Assume that keys are represented in a base N numbering system where N is known as the radix, For example, if N = 2, the keys are respresented by binary numbers.

For the primitive int type, b = 32.

Sorting is done by looking at bits in the same position. No comparisons are needed.

This idea can be extended to decimal integers with a fixed number of digits or to keys that are ASCII or Unicode strings.

Radix Exchange Sort Examine the bits from left to right:

Sort array with respect to the leftmost bit.

Partition array

Recurse

recursively sort the top subarray, ignoring leftmost bit.
recursively sort the bottom subarray, ignoring the leftmost bit.

Time: O(bn)

Radix Exchange Sort How do we do the sort on a particular bit position? Same idea as the partition step in quick-sort.
repeat scan top-down to find next key starting with 1
scan bottom-up to find next key starting with 0
exchange keys until scan indices cross

Radix Exchange Sort


array before sort	array after sort on leftmost bit	array after recursive sort on second-from-leftmost bit

Radix-Exchange Sort vs. Quick-Sort

Similarities:

Both partition the array
Both recursively sort on sub-arrays.

Differences:

Method of partitioning

Radix-exchange divides array based on greater than or less than 2^i-1. No comparisons are performed.
Quick-sort partitions based on greater than or less than some element of the array. Requires comparisons.

Time complexity

Radix-exchange O(bn)
Quicksort, average case O(n log n)
Quickwort, worst case O(n²)

Straight Radix Sort Examines bits from right to left
for k ¬ 0 to b-1 do sort the array in a stable way, looking only at bit k.

Note the order of the affected bits after sorting.

What does it mean to "sort in a stable way" ??!! In a stable sort, the initial relative order of equal keys is unchanged.
For example, observe the first step of the sort from the previous page (i.e., the sort on the rightmost bit):

Note that the relative order of those keys ending with 0 is unchanged, and the same is true for elements ending in 1.

The Algorithm is Correct (right?)

We show that any two keys are in the correct relative order at the end of the algorithm.

Given two keys, let k be the leftmost bit-position where they differ.

At step k, the two keys are put in the correct relative order.

Because of stability, successive steps do not change the relative order of the two keys.

Stability (cont.) Consider a sort on an array with these two keys:


It makes no difference what order the two keys are in when the sort begins.	When the sort visits bit k, the keys are put in the correct relative order.	Because the sort is stable, the order of the two keys will not be changed when bits greater than k are compared.

Decimal Numbers

Radix sorting can be applied to decimal numbers that have a fixed or maximum number of digits.

Note the order of the affected bits after sorting.

Analysis of Straight Radix Sort for k ¬ 0 to b-1 do sort the array in a stable way, looking only at digit k.

Suppose that we can perform the stable sort on a single digit in O(n) time. Then the total time complexity would be

O(bn).

Indeed, we can perform a stable sort on the key’s k^th digit in O(n) time.

The embedded algorithm is known as bucket sort.

Bucket Sort

n numbers
Each number
stable
time complexity: O(n + N)

As an example, consider the case where N = 3 and let our array be (k^th digit shown only)

(note that there are two "0"s and two "1"s.)

First, we create N "buckets"

Bucket Sort (cont.)

Each element of the array is put into one of the N "buckets."

Bucket Sort (cont.)

Now, pull the elements from the buckets into the array.

At last, the sorted array (sorted in a stable way):

Cost of Straight Radix Sort
O(b(n+N)) where N = number of buckets in bucket sort.

representation	N
binary	2
decimal	10
hexidecimal	16
ASCII	128

What is it?
A review of familiar material?
A side order for your eggs?
A combination of the two?

Adapted from the Goodrich and Tamassia lecture on Hashing

Dictionary Problem

A phone company wants to provide caller ID capability:

Given a phone number, return the caller’s name.
Phone numbers are in the range 0 £ r £ R where here R = 10⁷ - 1.
They want to do this as efficiently as possible.

A few suboptimal ways to design this dictionary:

an array indexed by a key: takes O(1) time, O(n + R) – a huge amount of wasted space.

a linked list: Takes O(n) time, O(n) space

A balanced binary tree: O(log n) time, O(n) space.

A Better Solution

We can do better with a hashtable – O(1) expected time, O(n + N) space, where N is the size of the hash table.

We need a function to map the large range of keys into a smaller range of table indices.

e.g., h(K) = K mod N

Insert {402-3045, "CAJ (w)"} into a hashed array with, say, N = 5 slots.

4023045 mod 5 = 0, so {402-3045, "CAJ (w)"} goes in slot 0 of the hash table.

A lookup uses the same process: hash the query key, then check the array at that slot.

Insert {428-7971, "CAJ (h)"}

Collision Resolution

How do we deal with two keys that hash to the same index in the array?

One policy is to use separate chaining

Set up an array (i.e., table) of links, indexed by the keys, to sequences of items with the same key.

Separate chaining is the most time-wise efficient collision resolution policy. Other policies are more efficient space-wise.

Lafore Separate Chaining Applet

Pseudo-code for Dictionary ADT Methods

Constructor

Create a table of N sequences.

insertItem(k, e)

index ¬ h(k)
insert e into sequence at table[index]

findElement(k)

index ¬ h(k)
step through sequence at table[index], looking for a match.
if search was successful, return found element
else, return NO_SUCH_KEY

remove(k)

index ¬ h(k)
step through sequence at table[index], looking for a match.
if search was successful,
remove found item from sequence
return found element
else, return NO_SUCH_KEY

Hash Functions

A good hash function:

is quick to compute, and
distributes the keys uniformly throughout the table.

Dealing with non-integer keys:

find some way of converting the keys into integers.

If the key is a string, we could add the ASCII values of the characters.

Then use a standard hash function for integers.

A common choice for the hash function is

h(k) = k mod N.

Choosing N, the size of the table:

N = b^e (bad)

If N is a power of 2, for example, h(k) will give the e least-significant bits of k.
All keys with the same ending will go to the same place.

Prime number N (good)

helps ensure uniform distribution of keys.

Mid-square

h(k) = middle digits of k².

sometimes it is impractical to use a table whose size is a prime number.

This hash function can be used to reduce the probability of collisions when N = b^e.

e.g., table size is a power of 10

h(4150130) = 21526 4436 17100 h(415013034) = 526447 3522 151420
h(1150130) = 13454 2361 7100

e.g., table size is a power of 2

h(1001) = 10 100 01 h(1011) = 11 110 01
h(1101) = 101 010 01

Open Addressing Collision Resolution

Open addressing uses only an array. The array slots store the items directly.

Uses less memory than separate chaining.

Don’t need to store links, sentinels, etc.
Fewer empty array slots. Clusters spread through the array.

To lookup a key, start at hashed index probe, then step through table until either the key is found or an empty slot is found.

Slower than separate chaining. The lookup may have to step through many table indices before the key is found.

Removing an item poses difficulties. Either mark the deleted slot as empty or fill in the slot by shifting some elements downward.

Specific open-addressing resolution policies include:

Linear probing
Quadratic probing
Double hashing

Linear Probing

Linear probing is the simplest of the open addressing policies.

If the current slot is already being used, just try the next slot.

Algorithm linearProbingInsert(k, e)
Input: Key k, element e
    if(table is full)
        error
    probe¬ h(k)
    while table[probe] is occupied
        probe ¬ (probe + 1) mod N
    table[probe] ¬ (k,e)

Linear Probing Example

Table size N = 13.

h(k) = k mod 13
Insert the keys

18 41 22 44 59 32 31 73

0 1 2 3 4 5 6 7 8 9 10 11 12

Lafore Linear Probing Applet

Double Hashing

Uses two hash functions, h₁(k) and h₂(k).

Typically h₁(k) = k mod N.
Typically, h₂(k) = q – (k mod q),

where q is a prime number and q < N.

If N is prime, all slots in the table will eventually be examined.

Many of the same (dis)advantages as linear probing. Space-efficient but slow compared with separate chaining.

Distributes keys more uniformly than linear probing.

Algorithm doubleHashInsert(k,e)
Input: key k, element e
    if (table is full)
        error
    probe¬ h₁(k)
    offset¬ h₂(k)
    while (table[probe] is occupied)
        probe ¬ (probe + offset) mod N
    table[probe] ¬ (k,e)

Double Hashing Example

Table size N = 13
h₁(k) = k mod 13
h2(k) = 7 – k mod 7

Keys to be inserted:

18 41 22 44 59 32 31 73

0 1 2 3 4 5 6 7 8 9 10 11 12

Lafore Double/Quadratic Probing Applet

Theoretical Results

The load factor a is the average number of keys per array index:

a = n/N.

The analysis is probabilistic rather than worst-case.

Expected number of probes in a search

	not found	found
chaining
linear probing
double hashing

Expected Number of Probes in a Search vs. Load Factor

No comments:

Inspiring Quotes

An inspiring quote may be just what you need to turn your day around. Here are some of the most inspiring quotes ever spoken or written.

I hated every minute of training, but I said, “Don’t quit. Suffer now and live the rest of your life as a champion.”

–Muhammad Ali

“You can have anything you want if you are willing to give up the belief that you can’t have it.”
–Robert Anthony

“There is no man living that can not do more than he thinks he can.”

–Henry Ford

“The best way to predict the future is to create it.”

–Dr. Forrest C. Shaklee

“It’s not about time, it’s about choices. How are you spending your choices?”

–Beverly Adamo

“Success…seems to be connected with action. Successful people keep moving. They make mistakes, but they don’t quit.”
–Conrad Hilton

“Destiny is not a matter of chance; it’s a matter of choice.”

–Anonymous

“The future belongs to those who believe in the beauty of their dreams.”
–Eleanor Roosevelt

“The quality of a person’s life is in direct proportion to their commitment to excellence, regardless of their chosen field of endeavor.”
–Vince Lombardi

“It is never too late to be what you might have been.”
–George Eliot

“Do not let what you can not do; interfere with what you can do.”
–John Wooden

“One man with courage makes a majority.”
–Andrew Jackson

“Failure is the opportunity to begin again more intelligently.”
–Henry Ford

“Try not to become a man of success but rather try to become a man of value.”
–Albert Einstein

“The mind is its own place, and in itself can make a heaven of Hell, a hell of Heaven.”

–John Milton

"If u are student, working and preparing give a little extra effort after regular work. A small sacrifice of TV time, fun time, or facebook time can bring a lot of better things to life than you ever imagined."

-- Naam likhna jaroori nai samajhta.

Thank you for reading, be sure to pass this along!

Tuesday, 2 July 2013

Sets, Hashing

No comments:

Post a Comment