| anthony's homepage | - | lamarca.org |
My cache performance research focuses on the design and analysis of algorithms in the presence of caching. Cache miss penalties have been steadily increasing and memory overhead is now a dominant factor in the execution time of many programs. This trend leads to two important problems. First, many fundamental algorithms were developed without considering caching. The result is that "optimal" algorithms often perform poorly due to cache behavior. Second, despite the high cost of cache misses, very few designers understand the caching mechanism and the effects their programming decisions have on memory system performance. Bad cache performance is often described as "cache effects" and is dismissed as randomness in the execution rather than a factor than can be understood and controlled.
My research addresses these two problems. In my papers, I revisit the design decision of classic algorithms and explore the opportunities offered by a cache-conscious design. I show that a balanced approach that considers memory behavior as well as instruction counts yields better performance with less effort. Another goal of my research is to make analyzing and understanding cache performance easier. I have explored the deficiencies of existing program analysis tools and have developed a framework that facilitates cache-conscious design.
To demonstrate the benefits of cache-conscious design, I tuned the memory performance of classic searching and sorting algorithms. Consider for example, the implicit heap. Traditional heaps are implicit representations of binary trees. If more that two heap elements fit in a cache block, the spatial locality of heaps can be greatly improved by increasing the fanout of the tree. This simple change combined with a small adjustment in data layout reduced cache misses in my experiments by more than 60%. This reduction in cache misses resulted in an overall speedup of up to 100%. My study of searching and sorting algorithms validates an approach that reduces cache misses as well as instruction counts.
To facilitate the analysis of cache performance, I have developed a model which predicts the cache performance of an algorithm without implementations or address traces. Collective analysis is a framework for analyzing algorithm behavior in direct-mapped caches. To validate my model, I applied collective analysis to heaps and compared the predictions to trace-driven cache simulation results. The comparison showed that collective analysis accurately predicts the heap's cache miss rate.
I have also sought to understand how the increase in cache miss penalties has changed the relative performance of familiar algorithms. I revisited a study of priority queues originally performed by Jones in 1986. The relative performance result I obtain on today's machine differ greatly from the machines of only ten years ago. In Jones's study, heaps performed poorly and pointer-based self-balancing structures such as splay trees performed best overall. In contrast, my experiments showed that pointer-based queues always incurred higher cache miss rates than implicit queues. These increases in miss rates far outweighed the instruction count advantages that the pointers introduce. In my experiments, the simple implicit heap outperformed splay trees by as much as a factor of four. These results strongly suggest that overall performance cannot be understood without an understanding of memory system behavior.
My work to date has had two thrusts: the design of fundamental algorithms with good memory system performance, and the development of tools that aid programmers in writing efficient code. I expect to continue working in both of these areas.
My algorithm work has three goals. First, it demonstrates the importance of optimizing cache performance. Second, it generates optimizations which can be generalized and formed into a set of techniques for improving cache performance. Lastly, a useful byproduct of this work is a group of algorithms with excellent overall performance.
My work on programming tools seeks to provide high-level information regarding a program's memory system behavior. Analytical models are difficult to apply and trace-driven simulations provide very little intuition. I am interested in developing tools that would provide useful high-level information regarding a program's memory system behavior. Tools such as Memspy and Cprof are a good first step, but they are still fairly primitive. They would not, for instance, provide any insight on the cache inefficiencies I found in heaps. I hope to develop tools that are easy to use and can suggest ways to improve a program's memory system behavior.