When Parallel Speedups Hit the Memory Wall

After Amdahl’s trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes a new parallel speedup model that accounts for the variations on the average data-access delay to describe the limiting effect of the memory wall on parallel speedups in homogeneous shared-memory architectures. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as the degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, our experiments show that conventional machine-learning modeling, besides being black-boxed, needs about one order of magnitude more measurements to reach the same level of accuracy achieved by the proposed model.


Introduction
Amdahl's Law [Amd67] has driven the chase for single-processor performance improvements for decades, but the end of frequency-upscaling and the stagnation of instruction level parallelism altogether led to the dawn of a new computational era: the multi-core and many-core era.
In this new era, parallel computing has become the conventional approach to achieve ever-increasing computational performance. Although parallelism is not new in computational systems, its real potential has been obfuscated for many decades by two main factors: Amdahl's skepticism on the ability of parallel systems to scale performance, and the exponential speed growth of single processor systems. It is now a consensus that Amdahl had a limited view on parallelism, and thus numerous works have been emerging towards expressing and exploiting the advantages that parallel computing can offer [Gus88,SN93,Shi96,HM08,SC10]. Continuing to broaden and explore different views on parallelism remains of vital importance in maximizing the potentials that parallel computing can offer.
This paper widens the views on parallelism by exploring the effects of the number of cores and their operating frequency on the data-access delay for parallel applications that make extensive use of the main memory. Memory-bound programs are hard to model because their behavior is volatile across runs with different inputs and system configurations due to the variability of how such applications exploit the memory hierarchy. We dedicate the following paragraphs to describe the existing views on parallelism, which we argue do not consider these aspects.
Amdahl showed that even a tiny not parallelized code fraction of an application could compromise the applicability of multiple processors to scale the application's performance [Amd67]. Long after Amdahl's work on the inability of using multiple processors to scale performance, Gustafson's "fixed-time speedup" approach to parallelism has shown that larger programs can benefit from more processors [Gus88]. Amdahl's "fixed-size speedup" had a limited view on the potential of parallelism. Gustafson's scaling model, known as Gustafson's Law, opened the path to the multi-core and many-core era. In [Shi96], the author unifies Amdahl and Gustafson's works and concludes that using the execution times instead of the serial and parallel fractions of the code could have avoided decades of unconstructive criticism against the advantages of using parallel processing. Sun and Ni [SN93] coined another prevalent model shortly after Gustafson's seminal work. The authors present a memory-bounded speedup model, known as Sun and Ni's Law. Their modeling demonstrates that the memory size is a limiting factor for parallel scalability.
More recently, other models extend these analyses to multi-core architectures, showing that they scale better for asymmetric and dynamic multi-core chips [HM08]. In [SC10], the authors summarize the contributions of three main speedup models (fixed-size, fixed-time, and memory-bounded speedups models) to the multi-core era, presenting a very optimistic view. However, their view assumes that the data-access delay is fixed and independent of the number of cores and problem sizes. This assumption is often unrealistic because of the memory wall [WM95], caused by the increasing data-access delay as the number of cores increases. In the following, we discuss three of the significant factors that can affect the data-access delay of an application: the application's problem/input size, the number of cores utilized, and the ratio of the processor's and memory's frequencies.
While the scaling of the problem size may affect the data-access delay, whether this effect is negative or positive for performance depends on the application's nature and on how the application is utilizing the targeted architecture. In general, increasing the input size can trigger a higher activity in the memory hierarchy, causing more cache misses, which subsequently generates more main memory accesses per cycle. Often, cache-blocking techniques can be applied to avoid or reduce this effect. The modeling presented in this paper does not consider variations in the problem/input size.
Increasing the number of cores can have an even more significant effect on the data-access delay depending on the architecture's characteristics. For instance, even with the problem size kept constant, using more processing cores can cause an increasing data-access delay because the rate of access-requests per cycle can increase due to more cores making simultaneous requests to the same memory. When the demand for accesses reaches the memory's nominal rate of attended requests per cycle, the average data-access delay starts to increase, stagnating the performance scaling in the number of cores, even for codes that are entirely parallel or that have a tiny serial fraction. Hence, for these cases, increasing the number of cores can indeed increase the data-access delay, which will undesirably generate an adverse effect on speedup in a form that resembles an increase in the serial fraction of the application. On the other hand, in the case of privatecaches, increasing the number of cores can lead to more available caches, and thus, to fewer memory accesses that, up to a degree, will have a positive effect on the data-access delay and thus will possibly allow further performance gains through parallelization.
A third factor to consider is the ratio of the processor's and memory's frequencies. If the processor is running significantly faster than the memory, the data-access delay relative to the processor speed may also increase. Considering all these factors and their interactions is crucial both for developing parallel programs that do not become bounded by the memory and for finding the optimal configuration of the number of cores and the processor's frequency that achieves maximum speedup for an application. Currently, there is no model to capture these effects altogether.
In this paper, we present a new analytical speedup-model for multi-core architectures that captures the adverse and the favorable effects on performance due to variations in the data-access delay caused by increasing the number of cores (see Section 2).
We initially investigate the potential abilities of our model to capture the above effects analytically (Section 3). The analytical results indicated that the speedup is dependent on the ratio between the frequencies of the processor and the main memory, both for memory-bound applications and for processor-bound applications that became memory-bounded after an increase in the number of cores. The analysis indicated that the larger this ratio, the higher its limiting effect can be on the speedup and that this limitation grows with the degree of parallelism of the code.
The proposed modeling was then fitted with actual hardware measurements to validate our analytical findings (Section 4). Furthermore, we demonstrate that our approach has higher accuracy and lower variance than Amdahl's model (Section 4.2). Comparisons to other analytical speedup models would not be more relevant since the other existing models differ from Amdahl's model by aspects that were kept constant in our experiments, such as the problem size and architectural features like memory hierarchy and the amount of memory available.
These behavioral aspects are orthogonal to the memory wall aspect and complement our work. We compare our model proposition to a non-linear machine learning regression approach (Section 4.3), which is arguably more flexible than any analytical model. In this comparison, the proposed model is demonstrated to exhibit a higher accuracy while using fewer hardware measurements.
Finally, based on the presented modeling and experimental results, we then discuss the implications that the contributions of this paper can have in applicationspecific multi-core design and towards more energy-efficient parallel software.
The paper is organized as follows. In Section 2 we present our modeling for speedup as a function of the ratio between processor and memory frequencies. In Section 3 we analyze the model behavior. In Section 4, we detail the methodology used to validate the proposed models and provide results of experiments in real hardware. In Section 5 we put our contributions in perspective with the existing literature and, finally, in Section 6, we draw conclusions and suggest future work.

Variable-delay speedup model
In this section, we devise a new parallel speedup model that accounts for the effect of the variation in the number of cores on the data-access delay. Furthermore, the model allows us to describe the effect that variations of the ratio between processor and memory frequencies have on the speedup.
Let us first restate the equation for the speedup of an application running in parallel with p cores as follows: where T s is the sequential time, measured when running the application on a single core processor, and T p is the time for running the same application in parallel with p cores. We now make some assumptions, necessary to devise the proposed model. These are later proved to be satisfactorily sustained by the model validation presented in Section 4: Assumption 1: the computations of a given application can be divided into two types of instructions: memory instructions and processor instructions. The former representing the loads and stores that generate accesses to the main memory and the latter representing those instructions that are carried out without data transfer and those loads and stores that are captured by the cache hierarchy. The total number of instructions is then given by where C is the number of processor instructions, and M is the number of memory instructions.
Assumption 2: the memory system is a centralized entity and serves all the processing cores uniformly, which reassembles most of current multi-core architectures. Assumption 3: For a specific processor frequency, the execution time of processor instructions can be approximated by an average value t c , which is inversely proportional to the processor operating frequency. Assumption 4: For a specific processor frequency and memory frequency, the time necessary to execute a memory instruction, as defined in Assumption 1, can be approximated by an average value t m .
Then, the sequential execution time for the computation of all W instructions can be given by Accordingly, the formulation of an equation for the parallel execution time for the computation of the same W instructions depends on how these instructions are distributed and carried out by multiple processing elements. We use a simplistic model first coined by Amdahl in [Amd67] to model parallel software. The computation is modeled by a parallel fraction f , representing the instructions that have no dependencies among them and that could be executed in parallel with no performance penalty, and its complement (1 − f ), which correspond to the serial fraction or the fraction of code that cannot be parallelized. The parallel execution time for p processing cores would then be given by Amdahl's model arises from combining (1) and (4), such that However, with Assumption 2, we must consider that the memory system can only attend requests at a given maximum rate. Therefore, the term that is divided by p in (4) cannot decrease indefinitely. In fact, the execution time of the whole parallel computation cannot be accelerated beyond t m M by increasing p, which leads us to the following equation for the parallel execution time of the W instructions with p processing cores.
Next, we devise a model that accounts for the variation in the number of memory accesses, dependent on the number of cores used, and the variation in the average duration of a memory instruction, dependent on the processor and memory frequencies ratio.
By combining (1), (3) and (6), we derive the first form of our speedup model: In terms of the ratio between the time to complete a memory instruction and the time to complete a processor instruction, by dividing everything by t c , we can rewrite (7) as where ρ denotes the ratio between t m and t c .
The average duration of a memory instruction should depend on the processor instruction execution time and memory access frequency according to Assumption 4, which we model as follows.
where k is an application model parameter that models how the computation of memory instructions is affected by the frequency of the main memory. The effect of k is stronger for memory-bound applications and weaker for those that are CPU-bound. So, considering (9) and Assumption 3, the ratio ρ can be expressed as where φ is the ratio between processor and memory frequencies, with F CPU and F Mem denoting the processor and memory frequencies, respectively. Finally, to remove the absolute values of M and C from (8), we can rewrite it in terms of the fraction of memory instructions over the total number of instructions, µ, as follows. Consequently, is the fraction of processor instructions over the total number of instructions involved in the computation. The ratio µ, however, is not fixed due to Assumption 1. When we vary the number of cores, the value of µ may also change due to the addition of more private caches, as discussed in Section 1. To account for variations in the number of memory instructions caused by variations in the number of cores, we rewrite (12) to express the final form of our proposed variable-delay speedup model as follows.
for µ p being the fraction of memory instructions observed when using p cores, defined by with m 1 and m 2 denoting application model parameters and µ 1 representing the serial case of µ p , with p = 1. The minimum function min(·, 1) limits the upper value of µ p to 1, which represents an application that is 100% dependent on memory instructions. The term m 1 accounts for the portion of accesses that are not affected by changes in the number of cores. The term m 2 accounts for the portion of accesses that vary with changes in the number of cores, which for example would vary µ due to the addition of more private caches. With more caches, the main memory receives fewer accesses, and µ should decrease.

Model Analysis
In this section, we perform two parametric analyses with the model proposed in (15) to investigate the model's behavior. What we intend is to present the model's ability to capture the performance-limiting behavior caused by a change in the data-access delay. Then, in Section 4, this ability is validated by fitting the model in (15) to hardware measurements. Firstly, we investigate the dependency between the number of cores and the data-access delay which causes the memory performance to decrease with an increase in the number of active cores. Secondly, we investigate the performance predictions for variations on the ratio between processor frequency and memory frequency.
Because exhaustive analyzes with seven parameters (f , k, m 1 , m 2 , f , φ, and p) would be impractical, we propose a set of parameter-value combinations whose variations can better expose the behavior expected to be modeled.

Number of cores versus data-access delay
We analyzed the behavior of the proposed speedup model for systems with 2, 4, 8, 16, 32 and 64 processing cores. We assumed a parallel fraction f = 0.99, representing a highly parallel code, and a processor and memory frequencies ratio φ = 3.0, which would denote, e.g. the memory functioning at 1.0 Ghz and the processor at 3.0 GHz. Fig. 1 presents the speedup plots of these configurations for different values of k, m 1 , and m 2 . As Fig. 1 shows, the model indicates that the ratio ρ, affected by k, has a significant effect on the speedups. The higher the k, the higher the limiting effect on speedups as the number of cores increases, which resembles the effect of a reduction of the parallel fraction of the code. So, the k parameter controls the memory access behavior of applications that depend on the variations of CPU and memory frequency ratio. For lower values of k and m 2 , the speedups saturate faster with the increase in the number of cores, indicating that the application transitions from a processor-bound mode to a memory-bound one. Fig. 1 also indicates the positive effects on the speedups caused by varying the number of cores with private caches. For larger values of m 2 , which drives the number of memory instructions down with the use of more cores, the speedups are considerably larger. Higher values of m 2 allow the transition to a memorybound mode behavior to happen at a larger number of cores with higher speedups whereas lower values force this to happen at smaller numbers of cores with lower speedups.

Frequency ratio versus data-access delay
The analytical results of the previous subsection indicate that memory-bounded applications lose the apparent advantages of using more cores to achieve more considerable speedups at some point. The capacity of the memory to hold down the average data-access delay limits the speedup. Nonetheless, the effects of varying the ratio between the processor and memory frequencies remain to be analyzed.
With the following analysis, we intend to show that, according to the proposed model, a memory-bounded application can become processor bounded with a suitable adjustment of the ratio φ in order to make the processor work more symbiotically with the memory and, thus, could avoid processor idling, increase efficiency and decrease energy consumption.
We analyzed the behavior of our speedup model for computational tasks with parallel fractions f = 0.99 running with 32 processing cores. Processor and memory frequency ratios varied according to φ = {1.0, 1.5, 2.0, 2.5, 3.0}, for which the plots are depicted in Fig. 2. Note, in Fig. 2, that larger speedups can be achieved by reducing the ratio φ in almost all analyzed configurations. This shows that the decay in memory performance could be avoided by a suitable reduction of the processor's operating frequency.

Model validation
In this section, we present the results of several modeling experiments in order to validate the proposed model with real applications running on multi-core processors.

Experimental Setup
We have measured the execution times for a set of applications varying the number of cores and their operating frequency in order to calculate their speedups for each frequency value. For our validation, applications from the PARSEC [Bie11] and SPLASH-2 [WOT + 95] parallel benchmark suites have been used. They comprise a large and diverse set of applications, covering several different application domains, such as computational finance, computer vision, real-time animation or media processing. In total there were 25 programs, 11 from the PARSEC suite and another 14 from the SPLASH-2 suite.
The measured execution times were used to fit the proposed model and Amdahl's model for each application. All model variables were fitted using the Particle Swarm Optimization (PSO) [KE95] global optimization method to minimize the Mean Squared Error (MSE) between the measured application speedups and their models. The PSO algorithm used was the version with the coefficient of constriction [CK02].
To vary the ratio between processor frequency and memory frequency, we changed the processor's frequency while the operating frequency of the memory system was kept at a fixed value.
The measurements were taken on a dual-socket shared memory platform with 2× Intel(R) Xeon(R) CPU E5-2680 v3, 12 cores at 2.50 GHz with hardware multi-threading disabled, and 30 MB shared L3 cache. The L1 and L2 private caches have 64 KB and 256 KB, respectively. The operating processor core frequencies ranged from 1.2 GHz to 2.5 GHz, with steps of 100 Mhz. The number of cores ranged from 1 to 24, with unity steps, except for some applications that have the number of cores limited to a power of two.
A Python version 3 library was developed 3 to implement the PSO algorithm and the utility methods to fit the models, to store the collected data, and to plot the graphs of the experiments performed in this paper. The repository also contains text files with information on measurements, execution metadata, the model parameters and the respective modeling errors for all experiments.
In Section 4.2, we assessed Amdahl's and the proposed model's accuracy by fitting them to each application using all measurements available to compute the MSE values.
In Section 4.3, we investigate how the accuracy of these models and the accuracy of an unstructured machine learning model vary according to the amount of information used to construct them.

Model accuracy
The accuracy for Amdahl's model and the proposed model is summarized in Table 1 for all applications in terms of MSE. The table also shows the number of measurement points available for each application. Each measurement point represents a configuration of frequency and number of cores. These points are relative to the median of 10 runs of an application. The MSE columns in Table 1 show that the results of the proposed model are considerably better than Amdahl's model, with the proposed model scoring always better or the same. The application with the most similar MSE value is "splash2x-lu-cb", whose accuracy was only 0.5% better than with Amdahl's model. On the other hand, "splash2x-water-spatial" was the application whose difference in MSE value was 90.24% better for the proposed model. On average, the proposed model was 42.40% more accurate than Amdahl's model considering all modeled applications.
To better present the ability of the proposed model to describe the speedup features of parallel applications correctly, we have selected a few applications for a more detailed analysis. For example, the PARSEC Dedup, a workload that uses "deduplication" to compress a data stream [BKSL08], presents small differences in the MSE values of the two models. This application is hard to model because of abrupt speedup variation due to workload imbalance among threads [SR16]. Nevertheless, the proposed model improves Amdahl's accuracy and accomplishes its task of modeling access-delay limitations by tilting speedups down for more substantial amounts of cores and larger φ ratios, as shown in Fig. 3b. The model manages to capture the angle of the speedups along the frequency axis which represents the φ ratio. The proposed model also presents a better fit for a smaller number of cores with a steeper slope enabled by the variable number of memory instructions in (16) that allows the modeling of the effect of overcoming cache size limitations.    For the PARSEC x264 application, an H.264/AVC video encoder, the proposed model reduces the MSE error by one order of magnitude. Fig. 4b shows how the proposed model surface is very close to the scatter plot of the measurements. It captures the super-linear speedup that occurs with this application because of the m 2 term in (16) that allows the number of memory instructions µ p to decay with increase of the number of cores. Fig. 5 presents the models for the SPLASH-2 Radiosity application. It computes the equilibrium distribution of light in a scene [WOT + 95]. One of the computational characteristics of this algorithm is a large number of memory instructions and, therefore, it is an appropriate case study to prove the proposed model's ability to capture the memory-wall effect on speedups. As in the previous applications, the proposed model presents a much better fit than the fit of Amdahl's model. Fig. 5b shows how the proposed model captures the speedup's slope that increases as processor frequency decreases. The model also captures the abrupt saturation that occurs when speedups hit the memory wall.        For SPLASH-2 Water Spatial application, which computes the forces that occur over time on a system with water molecules, Amdahl's model failed to capture the super-linear speedup behavior, achieving the worst MSE errors among the other applications, as Fig. 6 illustrates. The proposed model presents a better fit, despite it underestimating speedups at lower frequencies. Nevertheless, its accuracy is more than 90% better.

Accuracy versus the number of measurements
The results of the previous section were obtained using all available measurements for all configurations of processor frequency and the number of cores. In most cases, each application was executed on 336 different configurations-14 different frequencies and 24 different numbers of cores. For practical scenarios,   using as few measurements as possible is desirable to reduce the modeling overhead in terms of the use of computational resources and energy consumption.
In this section we study how the use of fewer sampling points affects model accuracy. With that we intend to support two claims: the proposed model can achieve reasonable accuracy even for a small number of measurements; and the number of measurements required for reasonable accuracy is much smaller than that required for unstructured models, such as those based on machine learning.
To support the former claim, we observed the accuracy of the models when fitted using various different numbers of measurements, starting from only 4 measurements and then doubling this number several times until reaching the closest power of two below the total number of available measurements for each application. To support the latter claim, we used a machine learning technique called Support Vector Machine Regression (SVR) [SS04] to model the applications using the same inputs used to fit the analytical models. Full details of the experiments can be found in the open-source repository mentioned earlier. In the following, we describe the methodology used to evaluate accuracy and variance for the three models under analysis: Amdahl's model (1) fitted with PSO; the proposed variable-delay model as given in (15)  For each number of samples, all measurement data were divided into a training or fitting set and a test set. The test set was always the remaining set of samples after removing the samples used to train or fit the models. The training or fitting for a given number of samples was repeated 100 times using each time a different set of random samples. All reported Mean Square Errors (MSEs) are the average of the MSE values of all 100 repetitions calculated using only the corresponding test sets. Fig. 7 illustrates the procedure used to compute the median of the MSE values for each set of 100 repetitions. The PSO method used 200 particles limited to 100 iterations to fit the analytical models. The minimum and maximum limits of the model parameters were set to be between 0.0 and 1.0, for f , m 1 and m 2 , and between 0.0 and 10.0 for k. For the SVR model we used the implementation of the Scikit-learn Python module [PVG + 11a]. The hyper-parameters of the Radial Base Function (RBF) kernel [PVG + 11b] used in the SVR were tuned using a 3-fold cross-validation with a grid search that was repeated for each new set of random measurements. The search range for the error penalty parameter C and the kernel coefficient γ were C = {100, 1000} and γ = {10 −05 , 10 −04 , 10 −03 , 10 −02 , 10 −01 , 1.0}.    Fig. 9 resumes all MSE results for each application using different numbers of measurements. The horizontal axis is in logarithmic scale and holds the number of sample measurements used to fit or to train the models: 4, 8, 16, 32, 64, 128, and 256 samples. Some applications restrict the number of cores that can be used, and thus, have fewer data points in the plots. For example, PARSEC Fluidanimate is limited to run only with numbers of cores that are a power of two. The last data point in the plot is always the power-of-two number immediately below the total number of measurements available for each application.  Table 2 shows the time spent to model the speedups of each application using the proposed and the SVR models. The values reported for the proposed model refer to the number of points at which the accuracy of the proposed model surpasses the accuracy of Amdahl's model. For example, for the Canneal application, the proposed model shows better results when the training set size was at least 16 points. On the other hand, the values reported for the SVR Model refer to the number of points at which the the SVR model achieves higher accuracy than the proposed model. In this case, for Canneal, SVR performs better only after 256 points are being used for training. The table shows that the difference in time and, proportionally, in energy consumption between both models can often be around one order of magnitude. On average, considering all applications, the SVR needed 293.27% more time to obtain better results than the proposed model.
The main behavior observed in Fig. 8 and Fig. 9 is that the analytical models obtain better results as they use more measurements for modeling until they reach a plateau. Another important observation is that all analytical models have better accuracy for smaller training sizes than the SVR model. Although the SVR model is generally more accurate for sets of measurements with more than 128 samples, the proposed model was overall better for the smaller-number sets except for size 4 and 8, for which Amdahl's models scored best in many cases.
The overall mean of the median MSE and standard deviation values of the three models across all applications according to the size of the sample set used in the modeling is depicted in Fig. 10 and Fig. 11.
In contrast to the machine learning model, the architecturally-inspired models require only a few executions of the application to provide reasonably good predictions of their speedups in configurations that were not previously assessed. This demonstrates an important advantage of these models, which allows an estimation of application performance for unseen configurations of a given architecture with reduced overheads of time and energy. On the other hand, if more sampling points are available, SVR provides better accuracy at the cost of a higher overhead.

Related Work
Inspired by earlier analytical models, such as [Amd67, Gus88,SN93], many more recent models attempt to capture better the behavior of application and architecture features that describe parallel speedups more precisely. None of them, Table 2. Time spend to collect applications measurements on specific number of points for each of Proposed and SVR model. not considering the effect of the memory wall on the modeled speedups, no hardware or simulation validation was presented to confirm their results.
Other analytical models for multi-core architectures consider the variations in parallel speedups caused by variations in the problem or input size, including the modeling of the parallel overhead [OFS + 18] or not [NSS15]. The parallel overhead was also modeled together with the parallel speedup for distributed parallelism in [HH17]. Similar to our work, these studies also validated the models using execution time measurements, but no feature was associated with the effect of the memory wall.
The work of Liu and Sun [LS17] combines the limitations related to the finite size of the memory [SN93] with memory access concurrency [SW14] to provide a speedup model that can be used for multi-core design space exploration. Although this model contains elements that relate to our data-access delay speedup model, the authors focus on chip design and perhaps, for this reason, do not explore the effects of frequency variations on speedups.
Therefore, to the best of our knowledge, this work is the first to explore this effect. For this reason, the only model mentioned in this section that we used for comparison was the original Amdahl's model, as many of the other works did. Moreover, since those models differ from Amdahl's by aspects that were kept fixed in our experiments, such as the problem size and architectural features like memory hierarchy and the amount of memory available, other comparisons would not be relevant to this study.

Conclusions
We have presented a new modeling approach for estimating speedups of parallel applications that are subject to the limitations of the memory wall. The proposed modeling considers variations in the data-access delay of the main memory when the number of cores increases and when the processor's or memory's operating frequency change; capturing the effect of changing the ratio between the processor's and the memory's frequencies. To the best of our knowledge, this behavior was not described by previous analytical speedup models.
Several hardware experiments presented in this paper validate the ability of the proposed models to describe the memory wall behavior in many different applications.
Our analysis shows that reducing processor frequency reduces the adverse effect of the memory wall on parallel speedups, suggesting that there could be an optimal processor frequency for each number of cores used to run a given application. Therefore, we argue that this work is not a pessimistic view of multicore scalability. Instead, it shows that the race toward single-core performance under the influence of Amdahl's Law has perhaps obfuscated a more efficient way to match processor and memory frequencies for parallel applications. That is undoubtedly true if the focus is energy efficiency; as such models could be applied, for example, to devise better Dynamic Voltage and Frequency Scaling (DVFS) schemes for the Internet of Things [GXdSE17], data centers [PPZ + 16], and high-performance computing [SFG + 18].
Ideally, these new DVFS schemes may also consider the number of cores used by the application, such as in [DSDMD18,LCB16]. To be practical for this, the speedup models need to be able to predict performance at non-visited configurations with the smallest possible number of measurements. In this sense, we showed that the proposed model can reach a level of accuracy with about a dozen of measurements that Support Vector Regression can only reach with hundreds of measurements. On average, our modeling presented higher accuracy than Amdahl's model, when using more than 8 random measurements, and than support vector regression, when using 128 random measurements or less. The standard deviation of our modeling was better than Amdahl's model for all number of measurements, and was better than Support Vector Regression for 64 random measurements or less.
In contrast with ML speedup models, the proposed model holds an inherent mapping of the application features, such as rate of memory versus processor instructions and the value of the parallel and serial fractions of the code, which is often relevant to software and hardware development. In its turn, machine learning schemes, such as Support Vector Regression, work as black boxes with relations between model parameters and applications behavior that are hard to infer. Additionally, evaluating analytical models is also faster, which makes it suitable for use in on-line performance and/or energy optimization schemes.
Despite the many different existing models for parallel speedups, the practical use of these models requires both better generalization and a lower fitting overhead. In this work, we have made contributions to both aspects, but there is still room for further improvements. For example, to make the model more general, the modeling of problem size could be included. For reducing fitting overhead, devising a heuristic to choose the initial measurements might work better than random sampling, as it has been observed in [Sen16]. For on-line fitting, increasing the complexity of the models as the number of measurements increases might also reduce fitting overhead. Extending this approach for speedup models in heterogeneous systems [BSVXdS15] is also promising, as the use of these systems has grown substantially in recent years.