Skip to Main Content
Hardware performance counters are available on most modern microprocessors. These counters are implemented as a small set of registers that count events related to the processor's functions. The Perfctr toolkit is one of the most popular toolkits (for x86 processors) for monitoring these events. In this paper, it is used to discover the impact of L1 data cache misses on the overall performance of six integer sorting algorithms. Most of them are cache conscious algorithms recently introduced, or known to behave well according to previous simulations, or they are totally not explored. We demonstrate through experiments on an Athlon processor that a good balance between L1 data cache misses and retired instructions provides the fastest algorithm for sorting in practical cases. The fastest sorting algorithm is not obtained with the implementation that gives the smallest number of misses and the smallest number of instructions. The fastest algorithm in practice is thus a new flavour of merge-sort that we have developed and it beats its rival. Keywords: hardware performance counters, cache conscious and oblivious algorithms, in-core sorting algorithms, two levels memory hierarchy, parallelism at the chip level.