Hashing

last updated 4/20/07

Performance Summary of searching algorithms

 

Linear search

  • O(n) 
  • doesn't require a particular arrangement of the data 
  • easy to implement 
  • works for arrays and linked lists

Binary search

  • O(log n) 
  • requires sorted data
  • limited to array
 

Hashing

  • O(1), i.e., no dependence on the size of the data (for some implementations 
  • data in no order, not even chronological 
  • requires an array as a starting point

Unordered storage

  • O(1) for front of linked list or end of array

Ordered Storage 

  • O(n) for array or linked list
  • O(log n) for binary tree

Hash Storage

  • O(1)

 

 

 

 

 


Simple hashing

Ideas and observations

Example 1:

 

 

 


Example 2:

This requires a hash function so that a method call such as

hash ( key ) will return an index that can be used as the array subscript.

hashKey = studentId % 1600; 
// a simple in-line hash "function" public void add (Hashable element)
// Adds element to this list at position element.hash().
{
int location;
location = element.hash();
list[location] = element;
numElements++;
}
public Hashable get(Hashable element)
   // Returns an element e from this list such 
   // that e.equals(element).
{
   int location;
   location = element.hash();
   return (Hashable)list[location];
}


Problems with hashing

For hashing to work well you need

  1. A hash function that statistically distributes the hash indexes uniformly over the range of the array
  2. A method to effectively handle collisions

 

 

 

 


Designing a hash function

  1. Division: divide the key by the size of the array and use the remainder as the index (modulus %)
  2. Digit extraction: isolate digits out of the key (modulus and integer division, e.g. (key%10000)/100 extracts thousands and hundreds digits)
  3. Midsquare: square the key and extract some digits
  4. Folding: add up the digits or ascii/unicode codes of the characters of a string; weights may be added to positions of the characters
  5. Pseudo-random: use the key as a seed to a random number generator (RNG's that produce a number between 0 and 1, you multiply by the hash size to span the array; RNG's that produce an integer, you use "division" to ensure proper range)

It is common to use a combination. Usually the function will finish with division to guarantee that we generate a valid index.

Also, it is recommended to use a representative key set and generate a set of corresponding hash values and analyze its statistical properties for even distribution.

 

 

 


Collision resolution

The simplest method of collision resolution is call Linear probing:

final static int SIZE = 127; //a prime number is good choice
Object list [SIZE];
public static void add (Hashable element)
// Adds element to this list at position element.hash(),
// or the next free array slot.
{
int location;
location = element.hash();
while (list[location] != null)
location = (location + 1) % list.length;
list[location] = element;
numElements++;
}
public static Hashable get(Hashable element)
   // Returns an element e from this list such that e.equals(element).
   {
   int location;
   location = element.hash();
   while (!list[location].equals(element))
      location = (location + 1) % list.length;
   
   return (Hashable)list[location];
   } 

 

Removal issues

You should not remove items from the hash table in the case of linear probing. 
Why?

An approach could be

void remove (element){
     list [ element.hash( ) ] = null;
} 

Collisions, however, complicate the matter. We cannot be sure that our element is in location element.hash(). 

 


Clustering

Clustering or clumping of data in a local area is a phenomenon of hashing

primary clustering results directly from synonyms

secondary clustering occurs from collisions from clusters growing into other locations

To reduce secondary clustering one of the following can be used

  1. Quadratic probing: search H(k)+1, H(k)+4, H(k)+9,...H(k)+ i2, where i is the number of collision resolution iterations executed so far.
  2. Rehashing: create several hash functions H1, H2, H3, and use them in succession. The final one would have to resort to something above. Rehasing resolves a collision by computing a new hash location from a hash function that manipulates the original location rather than the element’s key.
  3. Random probing: search H(k)+r1, H(k)+r2, ... H(k)+ri
    I.e., each r is a call to the RNG (without reseeding)
  4. With hashing by pseudo-random numbers, you can just get the next random number without resetting the seed.  An RNG will generate the same sequence from a given seed.

 

 

 

 


Chained hash tables

 

 

 

 

 

 

 

private class LLNode {
//inner class definition for a node
   private Object value;
   private LLNode   next;
   private LLNode(Object value, LLNode next){
      this.value = value;
      this.next = next;
   }
}
final static int SIZE = 127;
LLNode HashTable[] = new LLNode[SIZE];
  //the array should be initialize to nulls
  //LLNode has 'next' and 'obj' inst vars
private int Hash(){
// as an example, this hash function folds strings
    int sum = 0;
    for(int i=0; i<key.length; i++){
       sum = sum + key.charAt(i)*(key.length-i);//weighted
    }
    return sum % SIZE;
}
public static void add (Hashable element)
// Adds element to this list at position element.hash(),
// or the next free array slot.
{
int location = element.hash();
//assume the key is unique         //insert new node in as the head of the synonym list     HashTable[location] = new LLNode(element,HashTable[location] ; numElements++;
}
public static Hashable get(Hashable element)
   // Returns an element e from this list such that e.equals(element).
{
    LLNode temp = HashTable[element.Hash()];    //get head of LL of synonyms
    while(temp != null && temp.obj.getKey().compareTo(key)!=0){  //search down the LL
        temp = temp.next;
    }
    if(temp != null){   //return NULL or pointer to found node
       return temp.obj;
    }else{
       return null;
    }
}

 
 Objects can be easily deleted in this implementation.

 

 


Performance of hashing

Factors affecting the performance:

  1. quality of hash function
  2. collision resolution technique
  3. utilization of space (chained hash table excepted)

The load factor, , measures how full a table is.

also can represent the probability that an element is used.

= .5 means that 1/2 of the table is full, or 50% probability that any one element is used

= 1 means that the table is full

= 2 means (in the case of a chained hash table) that each element has an average chain length of 2

Expected number of probes for successful search of a hash table

Linear probing  
Random, rehash, quadratic  1/*ln (1/(1-))
Chaining  
=
Linear probing =
Quadratic probinng =
Chaining =

 


Java hashCode( ) method

Every object created is stored for fast retrieval through a hash table.  The Object class defines a public function hashCode that effectively returns the address of the object. Therefore all Java objects have an associated hash code.

You can take advantage of the Java system's built in hashing function by calling the hashCode method for the object and applying % size to this return address for an easy hash.

The Java Object class exports a hashCode method that returns an int hash code. The standard Java hash code for an object is a function of the object’s memory location.

For most applications, hash codes based on memory locations are not usable. Therefore, many of the Java classes that define commonly used objects (such as String and Integer), override the Object class’s hashCode method with one that is based on the contents of the object.

If you plan to use hash tables in your programs, you should do likewise.

 


Java API Hashtable class

The library includes several other collection classes, such as HashSet, that provide an ADT whose underlying implementation uses the approaches described in this section.

HashSet class

The Java Library includes a HashTable class that uses hash techniques to support storing objects in a table.  This is a basic hash table of unique values. Identify the basic operations available.

Hashtable class

It is a map which is a pairing of keys and objects