Hashing

last updated 4/15/21

Performance Summary of searching algorithms

+ advantage and - disadvantage

Linear search

  • - O(n) 
  • + doesn't require a particular arrangement of the data 
  • + easy to implement 
  • + works for arrays and linked lists

Binary search

  • + O(log n) 
  • - requires sorted data
  • - limited to array

Binary  search tree

  • O(log n)
 

Hashing

  • + O(1), i.e., no dependence on the size of the data (for some implementations 
  • - data unordered, not even chronological 
  • - requires an array as a starting point

Unordered insertion/add

  • + O(1) for front of linked list or end of array

Ordered insertion/add 

  • - O(n) for array or linked list
  • + O(log n) for BST

Hash insertion/add

  • + O(1)

 

 

 

 

 


Simple hashing

When using hashing, the data elements are just being stored and retrieved.  There is NO order, thus defining compareTo() is irrelevant.

Ideas and observations

Example 1:

 

 

 


Example 2:

This requires a hash function (text calls it a compression function) so that a method call such as

hash ( key ) will return a valid index that can be used as the array subscript.

// a simple in-line hash "function"--another use of mod!!!
static public int hash(int hashKey){
int hashKey = studentId % 2000;
return hashKey; }
// Assumes a perfect hash function

public void add (Hashable newElement){
// Adds element to this list at position element.hash().
int location = hash(
newElement);
list[location] =
newElement;
numElements++;
}
public Hashable get(Hashable element) {
   // Returns an element e from this list such 
   // that e.equals(element).

  int location = hash(element);
return (Hashable)list[location]; }


Problems with hashing

For hashing to work well you need:

  1. A hash function that statistically distributes the hash indexes uniformly over the range of the array
  2. A method to effectively handle collisions

 

 

 

 


Designing a hash function

Issues:

Options for numeric keys

  1. Division: divide the key by the size of the array and use the remainder as the index (modulus %)
  2. Digit extraction: isolate digits out of the key (modulus and integer division, e.g. (key%10000)/100 extracts thousands and hundreds digits)
  3. Midsquare: square the key and extract some digits
  4. Pseudo-random: use the key as a seed to a random number generator (RNG's that produce a number between 0 and 1, you multiply by the hash size to span the array; RNG's that produce an integer, you use "division" to ensure proper range)

Options for non-numeric or string type keys

  1. Folding or digitizing: add up the digits or ascii/Unicode numeric codes of the characters of a string;
  2. Weights may be added to positions of the characters so that permutations are not naturally synonyms

Finalize the hash key to ensure that we generate a valid index in the size of the array.

Also, statistical testing is recommended: use a representative key set and generate a set of corresponding hash values and analyze the statistical properties for even distribution.

 

 

 


Collision resolution

The simplest method of collision resolution is call Linear probing.

For the add() operation to insert:

 

final static int SIZE = 127; //a prime number is good choice
Object list [SIZE];
public static void add (Hashable newElement)
// Adds element to this list at position element.hash(),
// or the next free array slot.
{
int location;
location = hash(
newElement);
while (list[location] != null)
location = (location + 1) % list.length;
list[location] =
newElement;
numElements++;
}


The get() operation needs to replicate the insertion  process. 

public static Hashable get(Hashable element)
   // Returns an element e from this list such that e.equals(element).
  {
   int location;
   location = hash(element);
while (list[location] != null && !list[location].equals(element)) location = (location + 1) % list.length; return (Hashable)list[location]; //may return null }

Removal issues

A naive approach could be

void remove (element){
     list [ hash( element) ] = null;
} 

You should not remove items from the hash table in the case of linear probing. 
Why?

Collisions, however, complicate the matter. We cannot be sure that our element is in location element.hash(). 

Linear probing is problematic for removal.

Could use an additional boolean flag to indicate that a slot was part of a synonym grouping. Only stop searching when encounter a false.

Disallow removal?

There are better ways to handle collisions and removals than linear probing. See chaining below

 


Clustering

Clustering or clumping of data in a local area is a phenomenon of hashing, due to  the  not-so-random nature of the data.

primary clustering results directly from synonyms

secondary clustering occurs from collisions from clusters growing into other locations

To reduce secondary clustering one of the following can be used

  1. Quadratic probing: search H(k)+1, H(k)+4, H(k)+9,...H(k)+ i2, where i is the number of collision resolution iterations executed so far. [See figure below.]
  2. Rehashing: create several hash functions H1, H2, H3, and use them in succession. The final one would have to resort to something above. Rehasing resolves a collision by computing a new hash location from a hash function that manipulates the original location rather than the elementís key.
  3. Random probing: search H(k)+r1, H(k)+r2, ... H(k)+ri
    I.e., each r is a call to the RNG (without reseeding)
  4. With hashing by pseudo-random numbers, you can just get the next random number without resetting the seed.  An RNG will generate the same sequence from a given seed.

 

 


Java hashCode( ) method

Every object created in Java is stored  in the  object heap for fast retrieval through a hash table.  The Object class defines a public function hashCode that effectively returns the address of the object. Therefore all Java objects have an associated hash code.

You can take advantage of the Java system's built in hashing function by calling the hashCode method for the object and applying % size to this return address for an easy hash.

The Java Object class exports a hashCode method that returns an int hash code. The standard Java hash code for an object is a function of the object’s memory location.

For most applications, hash codes based on memory locations are not usable. Therefore, many of the Java classes that define commonly used objects (such as String and Integer), override the Object class’s hashCode method with one that is based on the contents of the object.

If you plan to use hash tables in your programs, you should do likewise.

 


Improvements on hash tables

Problems with simple array as the base structure:

See text's implementation with basic array from chapter 8: Hmap.java


Improvements

Buckets

 

 

 

 

 

 

 

Databases use buckets for fast retrieval by hashing

Chaining


Sample code for a chained hash table

Note an element in this simple hash table can be of any type for which equals() is defined.

Data need not be Comparable.  There is no order.  The code is implementing a bag since it doesn't check for uniqueness.

We could use a LinkedCollection class for the linked list.

private class LLNode <T> {
//inner class definition for a node as before
   private     T       info;
   private LLNode<T>   link;
   private LLNode(T value, LLNode<T> next){
      this.info = value;
      this.link = next;
   }
}
final static int SIZE = 127;
LLNode HashTable[] = new LLNode[SIZE];
  //the array should be initialize to nulls
  //LLNode has 'next' and 'obj' inst vars
private static int Hash(T element){
// as an example, this hash function folds strings
    String key = element.toString();
    int sum = 0;
    for(int i=0; i<key.length; i++){
       sum = sum + key.charAt(i)*(key.length-i);   //weighted by position 
    }
    return sum % SIZE;
}
public static void add (T element)
// Adds element to this list at position element.hash(),
// or the next free array slot.
{
int location = hash(element);
//assume the key is unique         //insert new node in as the new head of the synonym list O(1)     HashTable[location] = new LLNode(element,HashTable[location]) ; numElements++;
}
public static T get(T element)
   // Returns an element e from this list such that e.equals(element).
{
    LLNode temp = HashTable[Hash(element)];    //get head of LL of synonyms
    while(temp != null && ! temp.info.equals(element)){  //search down the LL
        temp = temp.next;
    }
    if(temp != null){   //return NULL or pointer to found node
       return temp.info;
    }else{
       return null;
    }
}

 
 Objects can be easily deleted in this implementation.

 

 


Performance of hashing

Factors affecting the performance:

  1. quality of hash function
  2. collision resolution technique
  3. utilization of space (chained hash table excepted)

The load factor, α, measures how full a table is.

also can represent the probability that an element is used.

α= .5 means that 1/2 of the table is full, or 50% probability that any one element is used

α= 1 means that the table is full

α= 2 means (in the case of a chained hash table) that each element has an average chain length of 2

Expected number of probes for successful search of a hash table

Linear probing  (1+ 1/(1-α))/2
Random, rehash, quadratic (-ln (1-α))/α
Chaining  1 + α/2
α =
Linear probing =
Quadratic probinng =
Chaining =

 

 

 


Java API Hashtable class

The library includes several other collection classes, such as HashSet, that provide an ADT whose underlying implementation uses the approaches described in this section.

HashSet class

The Java Library includes a HashTable class that uses hash techniques to support storing objects in a table.  This is a basic hash table of unique values. Identify the basic operations available.

Hashtable class and HashMap class have elements as Mappings and uses buckets.

It is a map which is a pairing of keys and objects

 

return type Hash method Hash notes big-O Tree method Tree notes big-O
HashMap HashMap<K,V> ( )
HashMap<K,V>(int cap)
HashMap<K,V>(int cap, float loadfactor)
Constructor of the object can accept an initial table size and an optional loadfactor. TreeMap<K,V> ( ) Constructor of the object can accept a comparator for the keys
boolean containsKey(K key) O(1) containsKey(K key) O(log N)
boolean containsValue(V value) O(N) containsValue(V value) O(N)
V get(K key) O(1) get(K key) O(log N)
V put(K key, V value) O(1) if key already exists, value is replaced and old one returned, else null returned put(K key, V value) O(log N) if key already exists, value is replaced and old one returned, else null returned
V remove(K key)

O(1) if key found, value is returned, else null returned

remove(K key)

O(log N) if key found, value is returned, else null returned

int size() O(1) size() O(1)

There is no getFirst() or getLast() counterpart here as in TreeMap, but otherwise the methods are the same.


Map constructs in other languages

Some programming languages, (e.g., Awk, Haskell, JavaScript, Lisp, MUMPS, Perl, PHP, Python, and Ruby), directly support maps/dictionaries.

Many other languages, including Java, C++, Objective-C, and Smalltalk provide map functionality through their standard code libraries.

Other names for Maps

 

 


Extendible Hashing

This technique

Ideas:

Overflow of a bucket causes splitting of a bucket's contents into two buckets (original is kept plus one new bucket or file block)

Range of hash function has to be extendible to accommodate additional buckets

Example: family of hash functions based on h:

The hash key, viewed as a bit string, is mapped to a bucket through a directory

Assumes a uniform distribution

 

 


Extendible Hashing - Example

Data key hash value
Jack 0 1 0 1
Anne 1 1 1 1
Joe 1 1 0 0
Quincy  0 0 1 0
Sue 1 0 1 1
Bob 0 0 1 1
Ed 1 1 0 1

Inserting Jack , Anne, Joe , Quincy

[k=1]

Directory Index Bucket (limit=2)
0 -> 1 1: 0101 Jack; 0010 Quincy
1 -> 2 2: 1111 Anne; 1100 Joe

Sue (1011) causes directory expansion, bucket addition and rehash of full bucket

[k=2]

Directory Index Bucket (limit=2)
00 -> 1 1: 0101 Jack; 0010 Quincy
01 -> 1 2: 1111 Anne; 1100 Joe
10 -> 3 3: 1011 Sue
11 -> 2    

Bob (0011) causes bucket addition and rehash

Directory Index Bucket (limit=2)
00 -> 4 1: 0101 Jack;
01 -> 1 2: 1111 Anne; 1100 Joe
10 -> 3 3: 1011 Sue
11 -> 2 4: 0010 Quincy; 0011 Bob

Ed (1101) causes directory expansion, bucket addition and rehash (exercise to fill in the table)

[k=3]

Directory Index Bucket (limit=2)
000 -> 4 1: 0101 Jack;
  2: 1111 Anne; 1100 Joe
010 -> 1 3: 1011 Sue
  4: 0010 Quincy; 0011 Bob
100 -> 3    
     
110 -> 2    
     

 


Extendible Hashing Deficiencies:

Extra space for directory

Cost of added level of indirection: