I found this article and thought it will be interesting to read.
Authors are Ying Shi and Eushiuan Tran, 18-742 Advanced Computer Architecture, Carnegie Mellon University, Pittsburgh, Pennsylvania.
Abstract
Sorting is used in many important applications such that there has been an abundance of performance analysis. However, most previous research is based on the algorithm’s theoretical complexity or their non-cached architecture. As most computers today contain cache, it is important to analyze them based on their cache performance. Current research concentrates on improving algorithms to decrease the cache miss rate, but has not specifically studied the sorting algorithms according to cache performance. In addition, researchers do not show how cache is affecting the performance of each sorting algorithm. In this paper, we proved that cache does affect the performance of sorting algorithms, in comparison with non-cached architectures. We ran experiments on speed timing, instruction count, and cache simulation for three sorting algorithms — quicksort, heapsort, and mergesort. In running these experiments, we proved that the overall performance is affected by cache. We also showed that cache affects the overall performance by causing a deviation between the experimental timing curves and the theoretical complexity curves. However, the effect of cache is not significant enough to change the non-cached architecture-based performance ranking. Quicksort was determined to be a good sorting algorithm in terms of average theoretical complexity and cache performance.
1.0 Introduction
Sorting is a fundamental task that is performed by most computers. It is used frequently in a large variety of important applications. All spreadsheet programs contain some sort of sorting code. Database applications used by schools, banks, and other institutions all contain sorting code. Because of the importance of sorting in these applications, dozens of sorting algorithms have been developed over the decades with varying complexity. Slow sorting methods such as bubble sort, insertion sort, and selection sort have a theoretical complexity of O(N2). Even though these algorithms are very slow for sorting large arrays, the algorithm is simple, so they are not useless. If an application only needs to sort small arrays, then it is satisfactory to use one of the simple slow sorting algorithms as opposed to a faster, but more complicated sorting algorithm. For these applications, the increase in coding time and probability of coding mistake in using the faster sorting algorithm is not worth the speedup in execution time. Of course, if an application needs a faster sorting algorithm, there are certainly many ones available, including quicksort, mergesort, and heapsort. These algorithms have a theoretical complexity of O(N log N). They are much faster than the O(N2) algorithms and can sort large arrays in a reasonable amount of time. However, the cost of these fast sorting methods is that the algorithm is much more complex and is harder to correctly code. But the result of the more complex algorithm is an efficient sorting method capable of being used to sort very large arrays.
In addition to varying complexity, sorting algorithms also fall into two basic categories — comparison based and non-comparison based. A comparison based algorithm orders a sorting array by weighing the value of one element against the value of other elements. Algorithms such as quicksort, mergesort, heapsort, bubble sort, and insertion sort are comparison based. Alternatively, a non-comparison based algorithm sorts an array without consideration of pairwise data elements. Radix sort is a non-comparison based algorithm that treats the sorting elements as numbers represented in a base-M number system, and then works with individual digits of M.
Because sorting is such a common task that is used by many applications and is performed by most computers, mathematicians and computer scientists have been researching and analyzing their performance for years. However, the performance analysis was mostly based on the theory behind the algorithm. For example, Donald Knuth’s The Art of Computer Programming, Volume III - Sorting and Searching is considered the bible for sorting algorithms because of his detailed analysis of many sorting algorithms. However, even though Knuth gives a complete analysis of the different algorithms, they are all based on a non-cached computer architecture. All of the analysis is based on the theoretical complexity of the algorithm. But as the cached computer architecture becomes common today, it is necessary to analyze how a cached memory affects the performance of these sorting algorithms. It is not to say that the theoretical analysis is useless. They are still useful because those are the fundamental analysis that is needed in analyzing any kind of algorithms. Even though there is an abundance of previous research on the performance of sorting algorithms, most of the research does not analyze how the sorting algorithms exploit cache. As all of today’s computers contain a cached memory architecture, this is an area that is definitely lacking in research. In addition, as the increase in memory access time is larger than the increase in processor cycle time, the cache performance of an algorithm will have an increasingly larger impact on the overall performance.
One of the few papers on the cache performance of sorting algorithms is The Influence of Caches on the Performance of Sorting by Anthony LaMarca and Richard E. Ladner. In this paper, LaMarca and Ladner investigates the performance of several sorting algorithms both experimentally and theoretically. They looked at four sorting algorithms — heapsort, mergesort, quicksort, and radix sort — and applied memory optimizations in order to improve their cache performance. Most of the paper focuses on how the sorting algorithms can be improved to lower the cache miss rate and on a theoretical analysis of predicting the miss rate based on the algorithm. They do not make any comparison between the different sorting algorithms in view of cache performance. In addition, they make no comparison between the experimental execution timing results and the timing results deduced from the theoretical complexity. Currently, research is being done at the University of Washington, Department of Computer Science and Engineering by Richard Ladner, on the design and analysis of algorithms for cache performance. His project investigates cache conscious algorithm design and analyzes them in terms of cache performance. Algorithms that he is investigating include sorting algorithms.
From the current research being done, we suspect that cache does affect the performance of these sorting algorithms. In this paper, we will show that this is true. In addition, we will also show how significantly the cache affects the algorithm’s performance. In section 2, we will give a short description of the three sorting algorithms that we will be analyzing — quicksort, heapsort, and mergesort. We decided to analyze these three sorting algorithms because they are fast sorting methods all with a theoretical complexity of O(N log N). Next, we will explain our experimental methodology. Section 4 includes the detailed results and analysis from our experiment. Finally, section 5 contains the summary and conclusions which includes future work in this area.
2.0 Description of Sorting Algorithms
2.1 Quicksort
Quicksort is a widely used sorting algorithm invented by C. A. R. Hoare in 1960. Its popularity has lasted since its invention because it is easy to implement, is a “general purpose” sort, and consumes fewer resources than many other sorting methods in different situations. Quicksort is an in-place algorithm (using only a small auxiliary stack) requiring on average N log N operations to sort N items. It also has an extremely short inner loop. The analysis of why its complexity is O(N log N) is found here: http://wannabe.guru.org/alg/. However, a drawback of the algorithm is that it takes N2 operations in the worst case. Quicksort is also known as a partition-exchange sort because that term captures the basic idea of the method. One of the elements is selected as the partition element. The remaining items are compared to it and a series of exchanges is performed. When the series of exchanges is done, the original sequence has been partitioned into three subsequences:
1. all items less than the partition element
2. the partitioning element in its final place
3. all items greater than the partition element
At this stage, step 2 is completed and quicksort will be applied recursively to steps 1 and 3. The sequence is sorted when the recursion terminates.
The partitioning process is carried out by selecting an element as the partition element. Then you scan a pointer from the beginning of the array until you find an element greater than the partition element, and scan a pointer from the end of the array until you find an element less than the partition element. These elements are exchanged because they are clearly out of place. This process is continued until the pointers cross, which is also the correct place to insert the partition element.
2.2 Heapsort
Heapsort is an elegant and efficient sorting method based on the operation of heaps. A heap is a “complete binary tree”. It is constructed by placing a node called the root, and then going down the page from left to right, connecting the two nodes under each node until all of the nodes have been placed. The two nodes below each node are called its children, and the node above each node is called the parent. The item in each node should be larger than its children, so the root contains the largest item to be sorted. Heapsort is a true “in place” sort because it uses no extra memory and is always guaranteed to sort N elements in N log N steps, no matter what the inputs are. However, since its inner loop is about twice as long as that of quicksort, it is about twice as slow on average than quicksort. The basic idea is to create a heap containing the items to be sorted, and then to remove them in order. This means that an element is pulled off from the top of the heap, and then the largest element below is moved to the top. This continues until all of the elements are moved to the top and removed, yielding a sorted array.
2.3 Mergesort
To sort an array using the mergesort algorithm, you first divide the array to be sorted in half, sort the two halves recursively, and then merge the two halves together. It is essentially a complement of the quicksort algorithm, which sorts by the process of selection. Selection and merging are complementary operations because selection splits an array into two separate arrays, while merging joins two separate arrays into one array. Quicksort basically entails a selection procedure followed by two recursive calls. Complementarily, mergesort entails two recursive calls followed by a merging procedure. Like heapsort, mergesort will sort an array of N elements in time proportional to N log N, evenin the worst case. The main disadvantage of mergesort is that extra space proportional to N is required in the sorting process.
3.0 Experimental Methodology
3.1 Apparatus
All of the experimental timings are done on a Digital AlphaStation 500/400 MHz. The AlphaStation has a split level one cache of 8K each, a unified level two cache of 96K, and a unified level three cache of 2M. We examined three sorting algorithms in our experiment — quicksort, heapsort, and mergesort. We used Atom to obtain the instruction count, and Dinero for cache simulations. Atom is a generalized instrumentation tool that runs on Digital’s Alpha workstations. Dinero is a popularly used cache simulator.
3.2 Experiment
3.2.1 Timing
From the current research being done, we suspect that cache does affect the performance of these sorting algorithms. To prove this in our experiment, and to investigate how cache affects the algorithm’s performance, we started by getting timing results for varying array sizes. The elements in the array were randomly generated, but the same set of randomly generated elements were sorted for each sorting algorithm. Since these sorting algorithms all have very good performance, we were not able to see enough accuracy around the smaller array sizes. This resulted in graphs that had small jumps for small array sizes, meaning that the timer quantization was yielding inaccurate answers. Therefore, to achieve more accuracy in the timing results, we ran the timings over more than one run. We then plotted the execution time per key sorted (i.e. overall execution time divided by the sorted array size) which generated a more accurate curve around the smaller array sizes. Next, in order to explain the ranking of the program’s performance and their mismatch from the execution timings calculated from the theoretical equation, we ran further tests in exploring the performance of the sorting algorithms.
3.2.2 Instruction Count
To determine the instruction count over varying array sizes, we modified the Atom instrumentation files to keep track of the number of instructions it took to execute the sort subroutine for each sorting program. Atom instrumentation files can be found by looking here.
3.2.3 Dinero Cache Simulation
Finally, we also ran Dinero cache simulations to obtain the data miss rates for the three sorting programs.