Merge Sorting

last updated 4/20/08

 


MergeSort

This is another O(n log n) sorting algorithm that uses a common operation of merging two already sorted lists.

The algorithm is easy to specify recursively:

  1. Divide the array into two halves, A and B
  2. Recursively sort A and B
  3. Merge A and B into a full sized list

A quick complexity analysis is as follows

  1. Dividing into halves is O(1)
  2. C(N) = C(A) + C(B) + C(merge)
  3. C(merge) = O(n)

Since A=B= N/2, we have the pattern C(N) = 2*C(N/2) + O(n) which O(n log n)

 

 


Merging

Merging is a common operation that occurs outside the realm of sorting.  Often we have a sorted list and a second sorted list that we need to combine with the original.  Example: master file of employees, file of new hires and we want to merge them. One solution is to append the second list to the first and sort, but that is O(n log n).

Merging is a linear process.

Example:

2 4 6 6 9 10
5 7 12 13 15

General algorithm to merge two arbitrary Object arrays:

public static void merge (int leftFirst, int leftLast, int rightFirst, int rightLast)
{
   // pre: values[leftFirst]..values[leftLast] are sorted
   //      values[rightFirst]..values[rightLast] are sorted
   //Sorts values[leftFirst]..values[rightLast] by merging the two subarrays
   int[] tempArray = new int[SIZE];
   int index = leftFirst;
   int saveFirst = leftFirst;    // to remember where to copy back

   while ( (leftFirst < leftLast) && (rightFirst < rightLast) )
   {
     if (values[leftFirst] < values[rightFirst])
     {
       tempArray[index] = values[leftFirst];
       leftFirst++;
     }
     else
     {
       tempArray[index] = values[rightFirst];
       rightFirst++;
     }
     index++;
   }

   while (leftFirst < leftLast)
   // copy remaining elements from left half
   {
     tempArray[index] = values[leftFirst];
     leftFirst++;
     index++;
   }
       
   while (rightFirst < rightLast)
   // copy remaining elements from right half
   {
     tempArray[index] = values[rightFirst];
     rightFirst++;
     index++;
   }

   for (index = saveFirst; index <= rightLast; index++)
     values[index] = tempArray[index];

}

 


The MergeSort Algorithm

Note this uses twice the space.  Mergesorting is more difficult sorting within the original array.

public static void mergeSort(int first, int last)
//Sorts the values array using the merge sort algorithm
{
   if (first < last)
   {
     int middle = (first + last ) / 2;
     mergeSort( first , middle );
     mergeSort( middle+1 , last);
     merge(first, middle, middle+1, last);
   }
   
}
 


Importance of the Merge Sort

All of the sorting algorithms are "in-memory" sorts.  They assume the data are all in an array (we need the random access capability). 

In a database setting, the amount of data to be sorted is too large to effectively store in memory at once.  The data resides in files in a sequential format.  We think of files being organized into records and fields of the records.  Records represent a related collection of data--similar to the idea of an object.  An employee record (object), for example.

Files can be viewed as an array and you can access records randomly, i.e., get the 25th record without accessing the preceding 24.

PROBLEM with random file access is TIME to access a record.  In memory access ~10nsec,  file record access ~10msec (1,000,000 factor difference).

Blocked records and buffering

The most efficient method to process the records of a file is to read sequentially.  Records of a file are "blocked". When you read the first record of a file you get a chunck of the file called a block which corresponds to a multiple of disk sectors (512, 1024, 2048, 4096, 8192 bytes, etc.).  That block is loaded into memory in one disk access.  The next few records are already in memory (nsec access). The number of records in a block is called the "blocking factor".  When the block's records are exhausted, then the next block is read in.  Double buffering techniques can overlap--start the second block input while the first is being processed.


Two-phase Merge Sort

Assume

Typical algorithm has two phases:

  1. Partial sort phase: sort M pages at a time; create F/M sorted runs on mass store, cost = 2F
  2. Merge Phase: merge all runs into a single run using M-1 buffers for input and 1 output buffer

Merge step: divide runs into groups of size M-1 and merge each group into a run; cost = 2F

Repeat the merge step until one run remains (each step reduces number of runs by a factor of M-1 )

Using merging a file can be sorted in O(n log n) time, minimizing the number of disk accesses.  You will have F blocks to read and write and there will be O(log n) passes. 

Example analysis:

 

 


Treesorting

Worst case has quadratic performance (degrades to the insertion sort implemented in a linked list).


Heap Sort

Use a heap array structure (priority queue) to sort.  Also O(n log n).