SPARC: Statistical Performance Analysis With Relevance Conclusions

The performance of one computer relative to another is traditionally characterized through benchmarking, a practice occasionally deficient in statistical rigor. The performance is often trivialized through simplified measures, such as the approach of central tendency, but doing so risks a loss of perspective of the variability and non-determinism of modern computer systems. Authentic performance evaluations are derived from statistical methods that accurately interpret and assess data. Methods that currently exist within performance comparison frameworks are limited in efficacy, statistical inference is either overtly simplified or altogether avoided. A prevalent criticism from computer performance literature suggests that the results from difference hypothesis testing lack substance. To address this problem, we propose a new framework, SPARC, that pioneers a synthesis of difference and equivalence hypothesis testing to provide relevant conclusions. It is a union of three key components: (i) identifying either superiority or similarity through difference and equivalence hypotheses (ii) scalable methodology (based on the number of benchmarks), and (iii) a conditional feedback loop from test outcomes that produces informative conclusions of relevance, equivalence, trivial, or indeterminant. We present an experimental analysis characterizing the performance of a trio of RISC-V open-source processors to evaluate SPARC and its efficacy compared to similar frameworks.


I. INTRODUCTION
B enchmarking is a conventional practice in the computing domain for assessing a computer's performance relative to another. A standard set of representative programs are executed, covering a wide range of functionality, in order to capture performance metrics. But, the resulting metrics often lack sufficient statistical rigor for extensive analysis. A geometric mean, arithmetic mean, or performance ratio is reported and accepted at face value without indication of the sample distribution or a confidence level. It promotes misleading performance evaluations that permeate throughout the computing industry. Suffice to say, measures of central tendency have appropriate uses, but in some circumstances, thorough statistical analysis is needed for meaningful performance evaluation.
The Hierarchical Performance Testing (HPT) framework in [1], VarCatcher framework in [2], and methodology in [3] highlight the complexity of conducting a robust analysis and the lack of statistical rigor surrounding traditional computer performance comparisons. While [1] relies on difference hypothesis testing with non-parametric statistics, [2] and [3] cite the lack of relevant information at the conclusion of hypothesis testing as motivation for their respective custom frameworks. The challenge is developing a methodology that relies on fundamental statistical inference common across fields of research, is simple to customize based on user requirements, and provides results relevant to the performance, rather than a custom software framework. To achieve this, we address the limitations of HPT with respect to hypothesis testing and model a new framework that improves the efficacy of their method.
This paper forces a clear distinction between two ideas that are often pooled incorrectly in hypothesis testing: statistical significance and practical relevance. Significance is the ability of our statistical test to detect an effect size [4] and is correlated to the type of test used. In difference hypothesis testing [5], failing to reject H 0 (i.e. lack of significance) does not imply lack of effect. Conversely, a difference hypothesis test that rejects H 0 (i.e. detects significance) expresses nothing about the practical relevance of the result. A statistical framework that only uses difference hypothesis testing is limited to identifying changes and does not address conditions of equivalence, or similarity between population samples within a margin.
To illustrate the limitation of difference tests, consider an exploratory data study to evaluate the performance of a security algorithm and its impacts to the Rocket RISC-V processor. We instantiated Rocket on a field programmable gate array (FPGA) and collected performance metrics. The experiment was performed with and without the algorithm enabled, 30 times each, and execution times were recorded. For statistical analysis, we used the HPT framework on the results to assess suitability towards the larger RISC-V processor performance evaluation conducted later in the text. We provide a density plot of two performance score distributions in Fig. 1. Hereinafter, we limit decimal precision to three digits with rounding for display purposes only, actual experiment calculations are conducted without rounding. The difference hypothesis test resulted in a statistical significance between the two distributions as shown in the figure. The median execution time of the program with the algorithm enabled as compared to disabled is 263.287 seconds and disabled is 263.677 seconds, or a percentage difference of 0.148% between them. But, the practical relevance of 0.1479% in our application was minuscule. We would have concluded the two execution times as approximately equivalent. Primarily conducting a difference hypothesis test excluded a condition in which both distributions would be considered equivalent.
This analysis highlights another key limitation of difference tests: the constructed null hypothesis is illogical and a difference is always detected with sufficient samples [6], [7]. In our analysis, the null and alternate hypotheses tested either a 0 difference between the two continuous response variables, or a difference detected, respectively. The test is structured given H 0 being true and if the probability distribution of our test statistic is low, then H 0 is rejected. But, this structured argument for point or exact null is a fallacy and has been debated for decades [6], [7], [8], [9], [10], [11]. The probability that a continuous random variable assumes any specific value is zero [12]. Likewise, with sufficient population samples the test will always detect a difference [13].
In embedded system performance evaluation, inadvertent data manipulation often occurs either due to rounding, or with an insufficient context of data output. Both can lead to incorrectly assuming that two response variables are equal or that the difference between them is 0 and affect the study. Returning to the example experiment, observe the original density plot with and without the algorithm enabled in Fig. 2. We graph the response variable density, execution time, which defaults to decimal precision of 5 based on the output software code. We overlaid the plots with a modified response variable density, by deliberately rounding data to decimal precision of 1. As shown in the figure, the characterization of our data distribution has altered significantly. Notably, the illustration fails to capture how altered data can proliferate into a difference hypothesis test. The differences to the nth decimal that once characterized our data are filtered out along with any insightful conclusions that could be derived. While it might seem obvious, we highlight the issue after encountering it in computer performance analysis research within the field. The aforementioned criticisms are not limited solely to computer performance evaluations, but to the field of null difference hypothesis testing in general.
To address the limitations illustrated above, we propose an improved statistical framework called SPARC. This appears to be the first computer performance analysis approach to combine difference and equivalence hypotheses tests and use the results to form four conclusions [14], [15] relevant to a computer performance evaluation under study. The main contributions of this paper are summarized below.
r Proposed a non-parametric framework, permitting analysis under distribution-free statistics tests, and developed with a straightforward procedure for implementation. Difference tests are conducted with a Wilcoxon Signed-Rank Test for paired computer performance observations for detecting statistically significant distributions. Subsequently, equivalence within a median tolerance is assessed for distributions statistically significant but practically irrelevant.
r Developed a methodology inspired by HPT [1]. Minimized the false positive error rate using a multiple hypotheses error correction. It provides scalability based on the number of benchmark programs executed without inflating the error rate.
r Implemented SPARC framework enhances analysis with a conditional feedback loop that discriminates between overpowered or underpowered performance evaluations.
r Evaluated the new methodology with a performance evaluation consisting of a trio of RISC-V softcore processors instantiated on a field programmable gate array (FPGA). The remainder of this paper is organized as follows. Section II summarizes related work and introduces the motivating HPT framework. Section III provides key fundamentals of statistical analysis with respect to difference and equivalence hypothesis testing. In-depth procedures are listed for constructing an equivalence margin and to conduct analysis with the framework. It also addresses limitations of nonparametric statistics in error correction and sample size estimation. In Section IV, an experiment is conducted between RISC-V softcore processors and the performance comparison analyzed with SPARC. Section V concludes the text with final remarks and future work.

II. RELATED WORK
This section summarizes the Hierarchical Performance Testing (HPT) framework methodology published in [1], which provides a statistical analysis framework for comparing the performance between two computers. Additionally, we review research that models the distribution of computer performance data through clustering. The following section will then establish our methodology inspired by the HPT framework.

A. BENCHMARKS
There are several benchmark programs specifically developed to evaluate a computer system's performance, often bundled together as a suite of applications. The System Performance Evaluation Corporation (SPEC) [16] is a popular example of a benchmark suite that can be compiled and executed on a variety of computer architectures. To compare two systems, one merely needs to execute a given benchmark on each system, after which the execution times can then be appropriately compared.
In some cases, benchmark programs may be very specialized in order to test specific functionality of a system under test (SUT); examples include testing floating-point operations or integer multiplication. After the SUT completes a benchmark, performance metrics are reported as time-based or throughput. Often, they are developed as separate software applications rather than originating from a sole benchmark suite such as SPEC. Over time, users consolidate the applications into a suite that is suitable for their requirements.
An example is the benchmark suite, RV8, compiled for the RISC-V instruction set architecture used later in the text.

B. HPT FRAMEWORK
In [1], the authors developed the non-parametric HPT framework to promote statistically sound computer performance evaluations. The framework is a methodology using difference hypothesis tests to compare benchmark suite results between two computers to determine superiority. The authors reveal common errors made with respect to parametric and nonparametric statistics while conducting performance evaluations.
Chen et al. illustrates the improper use of parametric statistical tests, such as the t-test, on non-normally distributed computer performance data. If the data collected from a computer benchmark is not properly characterized prior to statistical analysis, it could be incorrectly assumed to be parametric instead of non-parametric. Without appropriate verification tests, an assumption of the underlying distribution of the data may contribute to a faulty analysis and misled conclusion of the comparison. They evaluate a SPEC benchmark suite comparison that displayed a skewed non-normal distribution using the t-test which resulted in transforming the data to normality. The t-test concluded the under performing computer was superior, demonstrating the deficiency in assuming a distribution.
The Central Limit Theorem (CLT) is often used to characterize distributions as approximately normal given a large sample size [4]. Frequently, a minimum sample size of 30 or more is referenced in statistics to employ the CLT. Although, this was disputed for computer performance distributions in [1] with an experiment consisting of 32 000 benchmark performance scores. The analysis reveals that a sample size of approximately 500 observations still deviated from normality, but could be sufficient to utilize the CLT. Executing a number of benchmarks within a suite, 500 times each, appears inefficient based on the inconsistency of the data.
Many sources of variability and non-determinism exist within a computer system and the complex layers of interactions they are comprised of, discussed in [1]. Further, published performance evaluations routinely omit confidence intervals (CI), which provides a measure of the randomness of a variable and accuracy estimate of observed data [12].
Performance evaluations often report a collection of mean completion times or relative speedups and declare one to be superior, with little, if any, documentation of statistical methods used in the comparison. While the mean completion time or speedup serves a purpose as a visual exploration of data, incorporating additional statistics provides insight into the origination, or population, of the sampled data. Such insight is fundamental in determining the accuracy of observations and conclusion. Excluding statistical analysis undermines the original intent behind the performance comparison.
Thus, the authors in [1] developed the non-parametric framework to promote statistically sound computer performance evaluations. The HPT framework is a methodology using hypothesis tests to compare benchmark suite results between two computers to determine superiority. The significance level, α, is chosen prior to conducting the hypothesis tests; standard suggestion is 0.05 for one-tailed or 0.10 for two-tailed hypothesis tests.
In order to analyze the performance between two computers on a suite of benchmarks, a series of steps, which comprise the HPT framework were outlined by [1]. We provide the following abridged procedure for reference, and build upon it later in the text. Suppose a benchmark suite is used that contains n benchmarks and each benchmark is repeated m times. Matrices C A = [a i, j ] n×m and C B = [b i, j ] n×m must be constructed for both computers; rows represent the nth benchmark and columns represent the mth benchmark repeat of performance scores [1].
For each benchmark, a null (H 0 ) and alternative (H 1 ) hypotheses are tested for significance using a Wilcoxon Rank-Sum Test. If the results show statistical significance, reject H 0 that both computers are equivalent; else fail to reject H 0 . After the Wilcoxon Rank-Sum Test is complete for all n benchmarks, assign to a new column the score representing difference in medians on significant results for each benchmark or assign a 0 for insignificant results.
Concluding HPT is a comprehensive hypotheses test consisting of H 0 of general equivalent performance or H 1 general superior performance [1]. A Wilcoxon Signed-Rank Test is completed on the difference in median performance scores to either reject H 0 or fail to reject H 0 at the significance level.

C. CLUSTER METHOD TO DESCRIBE UNDERLYING PERFORMANCE SCORE DISTRIBUTIONS
In [17], the authors established a clustering method to model distributions of computer performance metrics. Observed data from benchmarking was non-parametric and density plots indicated bimodal and multimodal distributions. They surmised that the non-parametric distributions are a Gaussian mixture, a combination of multiple Gaussian distributions of clustered multivariate data. The clustering method determines population estimation parameters that could be used with more powerful parametric statistical tests over non-parametric.

III. PROPOSED FRAMEWORK METHODOLOGY
In this section, we introduce our SPARC method, which incorporates equivalence tests and family-wise error correction associated with multiple hypotheses tests. It reduces Type I errors and supports various conclusions for relevant and practical results of a performance evaluation.

A. ELEMENTS OF RELEVANT STATISTICAL PERFORMANCE EVALUATION
A key element in any statistical experiment, including benchmarking, is designing the experiment such that results provide valid statistical information required for analysis. Design of experiments [18] is a field of study dedicated to this aim. Our methodology focuses primarily on non-parametric statistical tests after benchmarking data has been collected and assumes the experiment uses an appropriate design. But we address three essential elements for consideration prior to conducting any data collection: 1) standardized hypotheses notation; 2) family-wise error correction; and 3) sample size estimation. Error correction and sample size estimation are implicitly linked when considering multiple benchmarks for statistical analysis; the number of samples affects the significance of a statistical analysis and the significance is affected by the overall error rate for the evaluation.

1) STANDARDIZED HYPOTHESES NOTATION
Before we formally present the rationale behind equivalence tests, we provide a standardized hypotheses notation used throughout the rest of this paper. In the introduction, we discussed limitations of an analysis that uses difference hypotheses which motivated the addition of equivalence tests. First, we introduce the term positivist theory derived from [19], to describe difference hypothesis tests. That is, the null hypothesis of a difference test H 0 is often defined as the lack of an effect or no difference between effects and is tested against an alternative hypothesis H 1 of significant effect or difference [5]. Positivist theory simply denotes H 0 and H 1 hypotheses of difference tests as H + 0 and H + 1 , incorporating the + symbol to reflect testing for a significant effect. Likewise, we introduce the term negativist theory [19], to describe equivalence hypotheses that test for a lack of effect (i.e. equivalence). Negativist theory defines the equivalence hypotheses H 0 and H 1 , as H − 0 and H − 1 . We use the positivist and negativist theory notations H + 0 , H + 1 , H − 0 , and H − 1 in this text to differentiate between difference and equivalence hypotheses.

2) MULTIPLE HYPOTHESIS ERROR CORRECTION
In the following, let X = x i,1 , x i,2 , . . . , x i,m and Y = y i,1 , y i,2 , . . . , y i,m for i = 1, 2, . . . , n denote independent samples of performance scores from Computer X and Computer Y on the nth benchmark, respectively. Each hypothesis test performed in a multiple evaluation experiment increases the probability of rejecting H 0 when H 0 is true (Type I error) defined as the Family-Wise Error Rate (FWER) [20]. In other words, in a family of comparisons that are related the false positive error rate increases [20]. The worst-case FWER for n total benchmarks tested at an α n is: where β is the number of benchmark tests plus an additional overall hypothesis of general performance.
Using an appropriate error correction method, we can control the family-wise error in the performance evaluation while still providing statistically significant results [20]. Each hypothesis test used to analyze a benchmark increases the FWER and requires correction. There are two methods we introduce here, the Bonferroni Correction [21] and Holm-Bonferroni Correction [22]. Each α correction method has its advantages and disadvantages that should be considered depending on a study's requirements. In the RISC-V evaluation later in the text, we use the Bonferroni Correction.
There are two benefits for using the Bonferroni Correction. First, it is a simple correction applied to every test in our study and, second, it allows calculating confidence intervals across benchmark comparisons [22]. It is widely used but has also been criticized as overcorrecting α to reduce Type I errors and subsequently reducing the probability of detecting any significance [20]. The method to calculate an error corrected α New is as follows: where n is the total number of benchmarks planned plus the overall hypothesis test and α Old is the overall requested alpha (0.05 for one-tailed, 0.10 for two-tailed tests).
After α New is calculated, the p-value of each benchmark hypothesis test is compared with α New to either reject H 0 or fail to reject H 0 : where p n represents the p-value of the nth benchmark. An alternative method that does not overcorrect α is the Holm-Bonferroni Correction [22]. The method corrects sequentially, calculating α New for each p-value comparison. While it provides stronger statistical power compared to Bonferroni, there is added complexity to determine confidence intervals based on a changing α New . We present the procedure as it pertains to our framework as an option if confidence intervals are not required. Let p n be denoted as the p-value calculated after conducting the Wilcoxon Signed-Rank Test, for the nth benchmark. Sort in ascending order such that p 1 < p 2 < · · · < p i for i = 1, 2, . . . , n. Assign α New based on ranks of the test until the first non-significant result is found (failed to reject H 0 ) and the correction is complete. Any further benchmark hypothesis tests are non-significant. The equation for this procedure is as follows:

3) SAMPLE SIZE
In computer performance evaluations, determining the proper sample size is a fine balance between under or over sampling for a proper test. The significance (p-value) of each benchmark analysis is correlated with the sample size [23]. If an insufficient number of samples are collected from a benchmark, there is risk of an underpowered test (i.e. not providing a significant result due to a low p-value). If an over abundance of samples are collected, then the risk is an overpowered study that inefficiently used resources.
There are multiple ways to calculate sample sizes for a t-test statistic based on an effect size estimate, such as Cohen's D [24], if the underlying distribution is known or assumed. However, for non-parametric statistic tests we make no assumptions on the underlying distribution. The methods in [25], [26], however, illustrate how an estimated sample size can be determined for the Wilcoxon Signed-Rank Test if assumptions are made on the effect size and an unbiased estimator through a resampling process. We conclude there is merit in applying the techniques to a computer performance evaluation to reduce the number of benchmark repeats or increase power of the tests. At the same time, execution times are often non-deterministic which suggests resampling observations with prior data could affect the outcome or provide inaccurate sample size estimations. While there are no clear methods available for sample size estimation suitable for our framework, two of the resulting outcomes will report if a benchmark test was underpowered or overpowered.

B. EQUIVALENCE TESTING
Instead of testing the significance that performance scores from two computers are different, we introduce an approach called equivalence testing [27], [28]. In difference testing, we attempt to prove the alternative hypothesis H + 1 of a significant statistical difference. If we fail to reject the null H + 0 of no difference, we can only conclude there was a lack of evidence to reject H + 0 . We cannot conclude equivalence because it was not tested. By adding equivalence hypotheses tests to the framework, we have additional information to make inferences of a performance evaluation.
Equivalence testing is often found in clinical settings to assess whether the effect of two treatments or medications are within a predefined equivalence margin [27]. The burden of proof for equivalence resides in the alternative hypothesis H − 1 . An equivalence margin [−δ, δ] establishes the range in which two variables contained within are considered practically equivalent at δ. In our context, the equivalence margin renders two statistically significant but practically irrelevant performance score distributions as equivalent if the response variable is within the predefined [−δ, δ] interval.
One widely used method for equivalence testing is the Two One-Sided Tests (TOST) procedure in [29]. Choosing an appropriate equivalence margin δ for the TOST is paramount to a performance evaluation [27]; selecting a δ which is too stringent risks excluding practically equivalent performance scores and selecting a δ which is too broad risks false equivalence. [27] proposed either using past studies or pilot studies to establish a δ, but we consider this unsuitable for computer performance evaluations as preexisting data is either lacking or includes components specific to a system. Instead, we suggest setting it tailored to the evaluation depending on the motivation and context of the study. As an alternative, an equivalence δ = 5% of the speedup ratio can be used between two computers on a benchmark with an equivalence margin of [0.95, 1.05]. The speedup ratio S is defined as:

C. COMBINING DIFFERENCE AND EQUIVALENCE HYPOTHESES
Combining hypotheses tests for difference and equivalence leads to practical and relevant conclusions not possible individually. Hypothesis testing for difference supports conclusions for statistical significance but lacks conditions for practical irrelevance or equivalence. Conversely, equivalence testing supports conclusions on equivalent distributions but lacks conditions for substantial performance differences that are of interest. Therefore, the prevailing solution is a combination of difference and equivalence testing for practical relevance [14], [15]. This following outlines the procedures in our framework for combining the two types of tests for a relevant performance evaluation. Our method changes the Mann-Whitney (Wilcoxon Rank-Sum) Test in [1] to a Wilcoxon Signed-Rank Test for paired observations. Although, with minor alterations, our procedures can still be applied to the Mann-Whitney Test. The RISC-V processors evaluated in the next section necessitated a paired non-parametric test.
Suppose we are evaluating two computer's performance on a benchmark suite consisting of n benchmarks, each repeated m-times. Let (x i , y i ) be the ith pair for i = 1, 2, . . . , m for Computer X and Computer Y of m observations on the nth benchmark. Construct matrices B n = [x i,1 , y i,2 , r i,3 ] m×3 for n = 1, 2, . . . , n for n benchmarks. Let r i denote the pairwise ratio x i /y i for i = 1, 2, . . . , m and M X/Y denote the median pairwise ratio of performance. We use the Wilcoxon Signed-Rank Test under the assumption that the ratios r i are continuous and symmetric around a common median θ = 1 [1]. Difference hypotheses for the two-tailed Wilcoxon Signed-Rank Test are defined as: r H + 0 : the median performance score ratio M X/Y of Computer X, Computer Y on the nth benchmark is symmetric around θ = 1 (corresponding with no location shift from the benchmarks) r H + 1 : the median performance score ratio M X/Y of Computer X, Computer Y on the nth benchmark is not symmetric around θ = 1 Conduct a Wilcoxon Signed-Rank Test with an α corrected for family-wise error to either reject H + 0 or fail to reject H + 0 . For brevity, we omit the procedure as it is readily available online or in statistics textbooks. However, we illustrate the procedure in detail for equivalence within a margin.
We utilize the non-parametric TOST Wilcoxon Signed-Rank Test for equivalence procedure in [30] with a median ratio [31] δ chosen a priori. Two one-sided tests are conducted to determine if the performance score distributions within the margin [−δ, δ] are equivalent. Since we use a ratio performance, our equivalence margin becomes [1 − δ, 1 + δ]. Both tests must reject the null for equivalence to be established [15]. The upper bound equivalence 1 + δ and lower bound equivalence 1 − δ signed ranks are computed and tested separately.
The upper bound equivalence, δ, null (H − 01 ) and alternative (H − 11 ) hypotheses are defined as follows: r H − 01 : the performance score ratio distribution x i /y i on the nth benchmark is greater than or equal to the upper bound equivalence 1 + δ r H − 11 : the performance score ratio distribution x i /y i on the nth benchmark is less than the upper bound equivalence 1 + δ The lower bound equivalence, 1 − δ, null (H − 02 ) and alternative (H − 12 ) hypotheses are defined as follows: r H − 02 : the performance score ratio M X/Y on the nth benchmark is less than or equal to the lower bound equivalence 1 − δ r H − 12 : the performance score ratio M X/Y on the nth benchmark is greater than the lower bound equivalence 1 − δ Let f i = (x i /y i ) − (1 + δ) for i = 1, 2, . . . , m denote the pairwise ratio for the mth observation minus upper bound 1 + δ for Computer X, Computer Y on the nth benchmark. Let ψ i denote the sign indicator of f i as: Rank R i for i = 1, 2, . . . , m the absolute values | f 1 |, . . . , | f i | in ascending order. The product R i φ i denotes the signed rank of f i . The test statistic, W − , for 1 + δ is the sum of absolute values of negative ranks defined as: where m denotes the number of m benchmark observations. Similarly, let g i = (x i /y i ) − (1 − δ) for i = 1, 2, . . . , m denote the pairwise ratio for the mth observation minus lower bound 1 − δ for Computer X, Computer Y on the nth benchmark. Let ψ i denote the sign indicator of g i as: Rank R i for i = 1, 2, . . . , m the absolute values |g 1 |, . . . , |g i | in ascending order. The product R i φ i denotes the signed rank of g i . The test statistic, W + , for 1 − δ is the sum of absolute values of positive ranks defined as: where m denotes the number of m benchmark observations. If (m < 6), determine the exact p-value from Wilcoxon Signed-Rank Test tables for a one-sided test with α separately for both W − and W + .
If (m ≥ 6), the rank distribution is approximately normal [32]. Therefore, calculate the z-score as follows: The outcomes from the non-parametric TOST of equivalence test and difference test are utilized together to determine a conclusion on the performance comparison between Computer X and Computer Y on the nth benchmark. In Table I  . Performance scores come from the same distribution. After all n benchmarks have one of the four relevance testing outcomes presented above, an optional Wilcoxon Signed-Rank Test for difference can be conducted depending on the results. In the case that all test's outcomes are equivalence, trivial difference, or a mixture of both, then the relevance testing is completed. Tests with all indeterminant benchmark results would likely require either additional samples or experimental design changes. For any other test outcome cases, an optional test can still be conducted with procedures detailed further in the text. We provide recommendations for publishing the results following the optional test procedure.
We can employ the optional Wilcoxon Signed-Rank Test for difference to determine an overall general performance comparison on the benchmark suite. Let R i = M X/Y for i = 1, 2, . . . , n denote the median ratio on the nth benchmark. For benchmarks not concluded as relevant difference, assign R i = 0; exclude it from the tests and reduce n, the number of benchmarks in the sample size, to n = n − 1 [12]. The test is excluded because the assumption of continuous variables under the null in a Wilcoxon Signed-Rank Test does not hold. The original family-wise error corrected α calculated prior to the evaluation remains unchanged to account for multiple hypotheses tests. Using the same procedures in the text above, the difference hypotheses for general performance comparison for a one-tailed Wilcoxon Signed-Rank Test are defined as: r H + 0 : the benchmark suite performance score ratios of Computer X, Computer Y are symmetric around θ = 1 (corresponding with no location shift from the benchmark suite) r H + 1 : the benchmark suite performance score ratios of Computer X, Computer Y are symmetric around theta θ > 1 (or θ < 1) Our framework provides outcomes that are practical and relevant to the study or performance comparison under consideration. We demonstrate the procedures with an evaluation of three RISC-V processors in the next section. Finally, we suggest writing a conclusion that includes the number of tests, outcomes (indeterminant, trivial difference, relevant difference, or equivalence), p-values, effect size in terms of location shift, α or confidence level, equivalence margin [−δ, δ], and justification for the equivalence margin for performance evaluations.

IV. RISC-V EVALUATION
In this section, we evaluate three RISC-V softcore processors on an FPGA with SPARC and evaluate the analysis in comparison to HPT. The experimental configuration, captured performance metrics, and test assumptions are discussed prior to the evaluation.

A. EXPERIMENT SETUP
Our evaluation consists of three RISC-V softcore processors instantiated on a FPGA and a benchmark suite containing eight benchmarks to validate our methodology. The RISC-V processors Shakti [33], Ariane [34], and Rocket [35] are open-source softcore IP designs implemented in hardware descriptive languages for synthesis on an FPGA. Each processor has its own system-on-a-chip design integrated within the build that includes, but is not limited to: L1 and L2 cache, DDR3 memory controller, and universal asynchronous receiver-transmitter. We instantiated them in a Xilinx Virtex-7 XC7VX485 T on a FPGA VC707 Evaluation Kit, using the Xilinx 2018.3 Vivado Design Suite. Both Ariane and Rocket had VC707 build configurations available, but Shakti required customization to port an existing FPGA build generation to the VC707. Shakti was customized by adding peripherals present on Rocket or Ariane, but absent from the Shakti build and did not affect the datapath of the processor. We implemented the softcores to operate at 50 MHz clock rate and the system configurations are listed in Table II.
We evaluated the processors with the benchmark suite, RV8, consisting of eight common benchmarks compiled for the RISC-V ISA and list their descriptions in Table III. For each processor, Vivado synthesizes the system-on-achip design and generates an FPGA-specific bitstream, which it loads to the VC707. We compile the operating system, Linux version 5.3, as the main execution environment for the software and use a script to batch execute each benchmark 30 times to capture the number of clock cycles to complete it. The performance metric, number of clock cycles, is our response variable. In the experiment, we chose a sample size of 30 for each benchmark to examine suitability of its distribution for applying the Central Limit Theorem in our analysis.

B. PERFORMANCE METRIC MEASUREMENT
Within each benchmark, we inserted code to capture the clock cycles with inline assembly through a RISC-V specific pseudo-instruction, rdcycle [36]. The code executes at program start and program completion to calculate the number of clock cycles and then divided by the clock rate to derive the execution time L as follows: L is used to calculate the speedup ratio S between pairwise comparisons defined as: We use the speed ratio to abstract out units of time and identify performance shifts that occur between processor comparisons.

C. SPARC FRAMEWORK SPECIFIC
To show correct application of our equivalence tests and conclusions, we define a wide [1 − δ, 1 + δ] equivalence margin for analysis. Specifically, we use δ = 0.50 and [0.50, 1.50] as the primary equivalence margin for our softcore processor performance comparison. The bounds are purposefully large to illustrate the effect of equivalence tests and relevance outcomes on an analysis. For our two primary RISC-V evaluations, there are a total of 34 hypotheses tests 2(8 + 8 + 1) conducted. There are two pairwise comparisons, Rocket to Ariane and Rocket to Shakti. A comparison between Ariane and Shakti was omitted here for space considerations. For the pairwise comparison Rocket to Ariane, we conduct 8 difference hypotheses tests for location shifts plus 8 equivalence hypotheses tests plus 1 for the overall analysis. The tests are repeated for the second pairwise comparison of Rocket to Shakti. Therefore, we set the overall evaluation error α = 0.05 which translates to a FW ER ≤ 0.82 518 using (1). We use the Bonferroni Correction method (2) to control the FW ER but still allow (1 − α/m) confidence intervals calculated. The error corrected α New = 0.0 014 706 which is compared to each benchmark test p i to reject H 0 or fail to reject H 0 .

D. RISC-V PERFORMANCE EVALUATIONS WITH SPARC FRAMEWORK
We present results to evaluate the efficacy of using the new methodology for performance comparisons beginning with Rocket to Ariane. In Fig. 3, we illustrate quantilequantile plots for each benchmark with data points as pairedobservation ratios against a theoretical normal distribution line. Visually, the plots for AES, Bigint, Norx, and Primes indicate non-normal distributions not suitable for parametric tests. The sharp curved data points around the normal line on AES and Norx are due to heavy tails and the large gap in data points on Bigint and Primes suggest bimodal distributions. In Fig. 4, benchmark densities are plotted and affirm multimodal distributions. We conducted Shapiro-Wilk Tests [37] and Kolmogorov-Smirnov Tests [38] for normality to affirm our visual analysis listed in Table VI and Table VII, respectively. We test each benchmark distribution against the Shapiro-Wilk and Kolmogorov-Smirnov H + 0 that the distribution is normal; rejecting H + 0 signifies the distribution is not normal. The Shapiro-Wilk Tests found Dhrystone and   Proceeding with the new relevance framework, we conducted difference and equivalence hypotheses tests on the  speedup ratio and the results are listed in Table IV. The median speedup ratio M X /M Y for Rocket (M X ) and Ariane (M Y ) shows a speedup to a faster time if greater than 1 and a slowdown if less than 1. The ideal ratio is 1 if the processors were equal in median execution time. Fig. 5 presents a bar graph plotting the median execution times listed in Table IV of  Rocket and Ariane within each benchmark. Each benchmark difference test rejected H + 0 , or that the performance score distributions are symmetric around θ = 1. In other words, there was a distribution location shift of the median speedup ratio. Further, the tests of equivalence at δ = 0.50 rejected H − 0 in all but the AES and Dhrystone benchmarks. The hypothesis test results, together with the four possible relevance choices from Section III-C, allow us to conclude that there is a relevant difference in median performance of the speedup ratio between the Rocket and Ariane RISC-V processors in 2 benchmarks, and a trivial difference in 6 benchmarks. We recommended previously in the text that the effect sizes should be listed, either in the evaluation conclusion, or as we listed in Table IV. In the performance evaluation of Rocket to Shakti, we present quantile-quantile plots in Fig. 6. The plots show nonnormal distributions in all benchmarks except for Bigint and SHA512. In contrast to quantile-quantile plots in Fig. 3, the distributions in Dhrystone, Norx, Miniz, and Qsort are highly skewed left and include heavy tails. The heavy tail in Dhrystone is caused by an outlier data point at −65 seconds. Similarly, outlier data points in Norx result in a heavy tail distribution and signify parametric tests could be affected if they were used. Further, density plots in Fig. 7 illustrate non-normal benchmark distributions. Results from Shapiro-Wilk Tests and Kolmogorov-Smirnov Tests for normality listed in Table VI and Table VII confirm that all benchmarks are non-normal except for AES, Bigint and SHA512. We could perhaps employ a different statistical analysis, examining the outlying data points to determine if they can be removed and then testing for normality again. This would require altering the α correction again, accounting for the additional hypotheses tests, and also adding justification for outlier data point removal. But if the process was successful and produced normal distributions, then parametric statistical tests could have been performed. We refrained from employing this technique because of the extensive time and experience required to distinguish between data points that are outliers versus data points that indicate a problem with the experimental design. Instead, SPARC was designed to test population medians with consideration that outlier data points are not removed.
Instead of removing any outlier data points though, we present results from the difference and equivalence hypotheses tests performed on the median speedup ratio of execution times for Rocket (M X ) and Shakti (M Y ) in Table V. The bar graph in Fig. 8 plots median execution times for Rocket and Shakti within each benchmark comparison. Again, the difference tests rejected H + 0 , indicating a distribution location shift of median speedup ratio. Alternatively, the tests of equivalence at δ = 0.50 rejected H − 0 in all benchmarks except for AES. Here, we can conclude that there is a relevant difference in the median speedup ratio performance between Rocket and Shakti on 1 out of 8 benchmarks and a trivial difference in the other 7. We also conclude from the effect sizes in Table V as the speedup ratio of median performance, Rocket only had a relevant difference of faster median speedup ratio over Shakti in 1 of the 8 benchmarks.
In addition to the benchmark tests, we conducted a final Wilcoxon Signed-Rank Test for an overall relevant difference between Rocket to Ariane, and Rocket to Shakti, in Table VIII. Each test previously found trivial differences between Rocket and Ariane on 6 benchmarks, therefore we reduced the sample size by 6 since this test is only concerned with relevant differences. In the Rocket to Ariane general performance comparison, we fail to reject H + 0 , which indicates that there is not enough evidence to support a conclusion of a relevant difference in performance between Rocket and Ariane. A similar test was conducted for the Rocket to Shakti general performance comparison, with a similar outcome. In the previous tests, 1 resulted in a relevant difference between Rocket and Shakti for AES. Therefore, we remove any trivial difference tests from consideration as stated in Section III-C and reduced the sample size by 7. Out of 8 benchmarks, there was only a relevant difference in performance between Rocket and Shakti in 1 benchmark, and subsequently the test fails to reject H + 0 . The results are not unexpected. It is reasonable to assume that two processors with similar levels of performance would require more than 1 benchmark to reach a conclusion. The insight gained from the test of general performance between Rocket to Shakti is that we failed to reject H + 0 of equal performance and a follow-on experiment with additional benchmarks would be required for further determination.

E. FRAMEWORK IN COMPARISON TO HPT
In order to compare the efficacy of SPARC to the HPT [1] framework, we consider some differences with respect to the benchmark statistics tests. As noted in Section IV-C, the observations are pairwise between processors and more appropriate for the Wilcoxon Signed-Rank Test used in SPARC. In HPT, the Wilcoxon Rank-Sum Test is usable on pairwise comparisons, but some information common to both populations is lost. Tests suitable for a difference of observations, likely remove variability shared between the two observations. In contrast to the Wilcoxon Rank-Sum Test, which compares two independent observations. The test statistic is another key difference between SPARC and HPT. In SPARC, we specifically identify the speedup ratio between processors as the test statistic, whereas HPT designates an unspecified performance score. Again, the key disparity derives from using Wilcoxon Signed-Rank Test or Wilcoxon Rank-Sum Test and how each framework classifies response variables as paired or unpaired. We perform the HPT tests according to the procedures in [1], but use a paired Wilcoxon Rank-Sum Test with the speedup ratio in order to provide a standardized comparison between frameworks.
In Section III-A, family-wise error was discussed, in addition to possible risks to a study if α is not corrected. Determining if the data tested is within a family, and therefore affected by FWER, can be subjective. To illustrate the

TABLE IX. HPT Framework Results for Wilcoxon Rank-Sum Tests in Both Comparisons
probability of making a Type I error, the FWER for SPARC and HPT is illustrated in Fig. 9. While our evaluation only used 8 benchmarks, without an α correction the FWER is 36.98% with HPT. But, the HPT framework lacks discussion on multiple hypothesis testing, nor does it discuss methods to correct α. This paper considers the omission as accidental and we purposefully discussed FWER in the SPARC procedures to remove ambiguity.
For HPT, two-tailed Wilcoxon Rank-Sum Tests were performed for each benchmark to determine whether Rocket or Ariane has a difference in median speedup ratio performance, listed in Table IX. The median speedup ratios are unchanged  from Table V, therefore we only list the test results. For each test, H + 0 was rejected at α = 0.10 indicating a difference in benchmark speedup ratio performance between the two processors. Similarly, we performed the same tests for Rocket to Shakti with results listed in Table IX. For each benchmark, H + 0 was rejected indicating a difference in speedup ratio performance between Rocket and Shakti.
Finally, a two-tailed Wilcoxon Signed-Rank Test is performed as HPT's general performance comparison across all benchmarks. The test was conducted twice, on Rocket to Ariane and Rocket to Shakti, listed in Table X. On both general In comparison to SPARC, the individual benchmark results by HPT illustrate the difference between each framework's concluding information. Specifically, in HPT each benchmark H + 0 was rejected compared to the 6 trivial difference results for Rocket to Ariane and 7 trivial difference results for Rocket to Shakti in SPARC. As indicated by the follow-on equivalence tests, the difference in performance was within the [0.50, 1.50] margin and subsequently each benchmark removed from the general performance comparison. The SPARC framework provides a method to establish an equivalency margin that the study has defined as similarly performing systems, compared to only detecting a difference in HPT. SPARC concluded there was only a trivial difference in performance in the majority of the benchmarks for both comparisons and resulted in a lack of evidence that supported a difference in performance. Further, a difference will always be detected in HPT for evaluations similar to the example discussed in the introduction and illustrated in Fig. 3. The SPARC framework excels in conditions of similar performance or equivalence, and we are able to use the additional insights from SPARC to influence follow-on experimental design.

V. CONCLUSION
In this paper, the statistical framework SPARC is proposed for a scalable and distribution-free performance evaluation of computers. SPARC identifies superiority or equivalence with hypotheses tests for each benchmark that conditionally result in four relevance conclusions. Through the application of an error correction method in SPARC, error inflation is reduced in multiple benchmark scenarios. Our performance comparison of three RISC-V softcore processor's performance on an FPGA indicated the efficacy of SPARC in relation to a similar framework. The additional insight from relevance conclusions enhances the study results and refines discussion for further experimentation if required.