1 Introduction
Performance profiling plays a crucial role in software development, allowing programmers to test the efficiency of an application and discover possible performance bottlenecks. Traditional profilers associate performance metrics to nodes or paths of the control flow or call graph by collecting runtime information on specific workloads [2], [3], [4], [5]. These approaches provide valuable information for studying the dynamic behavior of a program and guiding optimizations to portions of the code that take most resources on the considered inputs. However, they may fail to characterize how the performance of a program scales as a function of the input size, which is crucial for the efficiency and reliability of software. Seemingly benign fragments of code may be fast on some testing workloads, passing unnoticed in traditional profilers, while all of a sudden they can become major performance bottlenecks when deployed on larger inputs. As an anecdotal example, we report a story [6] related to the COSMOS circuit simulator, originally developed by Randal E. Bryant and his colleagues at CMU [7]. When the project was adopted by a major semiconductor manufacturer, it underwent a major performance tuning phase, including a modification to a function in charge of mapping signal names to electrical nodes, which appeared to be especially time-consuming: by just hashing on bounded-length name prefixes rather than on entire names, the simulator became faster on all benchmarks. However, when circuits later grew larger and adopted hierarchical naming schemes, many signal names ended up sharing long common prefixes, and thus hashed to the same buckets. As a result, the simulator startup time became intolerable, taking hours for what should have required a few minutes. Identifying the problem—introduced several years before in an effort to optimize the program-required several days of analysis. There are many other examples of large software projects where this sort of problems occurred [8].