DSA - Chapter 4
Chapter 4. Hashing
The Dictionary ADT
A dictionary is a set data type that supports 3 operations: INSERT, DELETE, and SEARCH.
A linked list or an array can be used to implement a dictionary, but searching is not very efficient.
Hashing
Hashing consists of 2 parts:
- A table (an array) of size m.
- A hash function h which maps a key (of an element) into an index (index) between 0 and m-1 inclusive.
An element with key k is mapped to slot h(k) of the array.
Problem with Hashing
2 keys may have the same hash value - a collision.
To handle collision:
- A good hash function, such that collision does not occur often
- A collision resolution strategy
Chaining (Open hashing)
Keeping a linked list of all elements that have the same hash value
Implementation
typedef struct nod *node_ptr;
struct node {
element ele;
node_ptr next;
}
typedef node_ptr LIST;
LIST T[m];
To initialise the hash table:
Chained_Hash_Init(T){
int i;
for (i=0; i<m; i++)
T[i] = NULL;
}
To search an element with key = k:
Chained_Hash_search(T,k){
node_ptr p;
p = T[h(k)];
while ((p != NULL) && (p -> ele.key != k))
p = p -> next;
return (p);
}
To insert an element x:
Chained_Hash_Insert(T,x){
node_ptr p;
p = T[h(x.key)];
insert x into the linked list pointed to by p;
}
For delete, if we only have the key of the element to be deleted, we have to search for it first.
Analysis
Given a hash table T with m slots that stores n elements, we define the load factor a for T as n/m.
Intuitively, a is the average number of nodes (elements) stored in a chain.
Insert: takes constant time $O(1)$
Delete: constant time if we have the pointer to the node and the list is a doubly linked list. Otherwise its running time is similar to searching
Searching: the worst case happens when all the keys stored are mapped to the same slot. $O(n)$
Simple uniform hashing
The performance of hashing depends on the hash function.
Good hash function should map the keys in the universe as uniformly as possible over m slots.
Which is ideal hash function called simple uniform hashing.
Search Complexity
Time required to evaluate the hash function + the time to traverse the list
Thus, average complexity of unsuccessful search is $O(1+a).$
In a successful search, the average number of elements examined is 1 + a/2.
Hash functions
A good hash function should satisfy the assumption of simple uniform hashing and be easy to compute (as efficient as possible).
If the universe of keys is not the set of natural numbers, we need to convert the keys to natural numbers first.
The division method
For a key k and a table size m: h(k) = k mod m
Choice of m is very important.
Some general rules of thumb for choosing the value of m:
- m should not be a power of 2
- m should not be a power of 10 if the application deals with decimal numbers
- Good values of m are prime numbers
- If (3) is too difficult to meet, pick m such that it has no prime factors less than 20.
The multiplication method
Procedure:
- Pick a constant 0 < A < 1
- Take the fractional part of k x A
- Multiply the result by m
- Take the integer part of the result
- h(k) = int(frac(k x A) x m)
Choice of A is important when one uses the multiplication method.
As a rule of thumb, A should be an irrational number.
The golden ratio would be the good choice.
Open addressing (Closed Hashing)
In open addressing, all elements are stored in the hash table itself.
Each table entry contains either an element of the dictionary or a special value (called “NIL”) which signifies that no element is stored in the slot.
2 advantages compared with chaining (By avoiding pointers):
- Saves time in not doing memory allocation and de-allocation
- Saves space
- Given the same amount of memory, enables to afford a larger number of slots, leading to fewer collisions.
Collision resolution with open addressing
When inserting an element, if a collision occurs, alternative slots are tried until an empty slot is found (called probing)
Insert:
Open_Address_Hash_Insert(T,x){
i = 0; //collision count
k = x.key;
do {
j = h(k,i);
if (T[j].key == NIL){
copy x to T[j];
return;
}
else
i++;
}
while (i < m)
return ("hash table overflow!");
}
Search:
To search for an element, follow the same probe sequence until the element is found or when an empty slot is encountered.
Open_Address_Hash_Search(T,k){
i = 0;
do {
j = h(k,i);
if (T[j].key == k)
return (j);
i++;
}
while ((T[j].key != NIL) && (i < m))
return ("Not Found!")
}
Delete:
To delete an element, follow the same probe sequence until the element is found, or NIL is encountered.
Open_Address_Hash_Delete(T,k){
i = 0;
do {
j = h(k,i);
if (T[j].key == k){
T[j].key = DELETED;
return;
}
i++;
}
while ((T[j].key != NIL) && (i < m))
return ("Not Found!");
}
Open_Address_Hash_Insert(T,x){
i = 0; //collision count
k = x.key;
do {
j = h(k,i);
//Insert can be changed so that it is looking for
//an empty slot or a deleted slot.
if ((T[j].key == NIL) || (T[j].key == DELETED)){
copy x to T[j];
return;
}
else
i++;
}
while (i < m)
return ("hash table overflow!");
}
Load Factor
Because all the data goes inside the table, a bigger table is needed for open addressing than for chaining.
In general, the load factor for open addressing should be below 0.5.
Probe sequence
Depending on the increment function $f(i)$ as in $h(k,i) = [h_1(k)+f(i)]$ $mod$ $m$.
Linear Probing
For linear probing, $f()$ is a linear function of $i$, typically, $f(i) = i.$
This amounts to trying the slots sequentially (with wrap around) in search for an empty slot.
Primary clustering
Problem with linear probing: blocks of consecutively occupied slots occur. These are called cluster.
For any key that “hashes” into a cluster, it will require several attempts to resolve collisions, and then the key will add to the cluster.
It makes searching inefficient.
Key problem: There is only 1 probe sequence for all the keys (although with different starting points).
Quadratic Probing
Try to eliminate primary clustering by allowing more probe sequences.
For quadratic probing, $f()$ is a quadratic function of $i$, e.g., $h(k,i) = [h_1(k)+i^2]$ $mod$ $m$.
Key problems:
- The probe sequence may not span the whole table
- There are only m probe sequences. This is called secondary clustering.
Double Hashing
Double hashing uses 2 hash functions to define a probe sequence:
\[h(k,i) = [h_1(k)+i(h_2(k))] mod m\]$h_1$ to determine the initial slot and $h_2$ to determine the “increment”, or the next slot to probe.
Two keys will have the same probe sequence if they have the same hash values for both $h_1()$ and $h_2()$.
Clustering therefore does not occur often.
Some constraints on $h_2(k)$:
- It should never evaluate to 0.
- It should be relatively prime to $m$.
Performance analysis
For an open address hash table with load factor a < 1:
With linear probing, the average number of probes for
- Unsuccessful search = $\frac{1}{2}(1+1/(1-a)^2)$
- Insertion = $\frac{1}{2}(1+1/(1-a)^2)$
- Successful search = $\frac{1}{2}(1+1/(1-a))$
With double hashing, the average number of probes for
- Unsuccessful search = $1/(1-a)$
- Insertion = $1/(1-a)$
- Successful search = $(1/a)(ln(1/(1-a)))$
Rehashing
With open addressing, if the table gets too full (a > 0.5), the run time of the operations will start taking too long.
Solution: Build a larger hash table and rehash the elements to the larger table.