Towards Meaningful Statements in IR Evaluation. Mapping Evaluation Measures to Interval Scales

Recently, it was shown that most popular IR measures are not interval-scaled, implying that decades of experimental IR research used potentially improper methods, which may have produced questionable results. However, it was unclear if and to what extent these findings apply to actual evaluations and this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of meaningfulness specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between using the original measures and the interval-scaled ones. For all the regarded measures - namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but overall, on average, we observed a 25% change in the decision about which systems are significantly different and which are not.


Introduction
By virtue or by necessity, Information Retrieval (IR) has always been deeply rooted in experimentation and evaluation has been a formidable driver of innovation and advancement in the field, as also witnessed by the success of the major evaluation initiatives -Text REtrieval Conference (TREC) 1 in the United States [46], Conference and Labs of the Evaluation Forum (CLEF) 2 in Europe [36], NII Testbeds and Community for Information access Research (NTCIR) 3 in Japan and Asia [76], and Forum for Information Retrieval Evaluation (FIRE) 4 in India -not only from the scientific and technological point of view but also from the economic impact one [71].
Central to experimentation and evaluation is how to measure the performance of an IR system and there is a rich set of IR literature discussing existing evaluation measures or introducing new ones as well as proposing frameworks to model them [20,66]. The major goal is to quantify users' experience of retrieval quality for certain types of search behavior, like e.g. users stopping at the first relevant document, or after the first ten results. Most of the measures proposed are based on plausible arguments and often accompanied by experimental studies, also investigating how close they are to end-user experience and satisfaction [51,106,107]. However, little attention has been given to a proper theoretic basis of the evaluation measures, leading to possibly flawed measures and affecting the validity of the scientific results based on them, especially their internal validity, i.e. "the ability to draw conclusions about causal relationships from the results of a study" [25, p. 157] A few years ago, Robertson [70] raised the question of which scales are used by IR evaluation measures, since they determine which operations make sense on the values of a measure, as originally proposed by Stevens [85]. Scales have increasing properties: a nominal scale allows for determination of equality and for the computation of the mode; an ordinal scale allows only for determination of greater or less and for the computation of medians and percentiles; an interval scale allows also for determination of equality of intervals or differences and for the computation of mean, standard deviation, rank-order correlation; finally, a ratio scale allows also for the determination of equality of ratios and for the computation of coefficient of variation. Recently, Ferrante et al. [32,33] have theoretically shown that some of the most known and used IR measures, like Average Precision (AP) or Discounted Cumulative Gain (DCG), are not interval-scales. As a consequence, we should neither compute means, standard deviations and confidence intervals, nor perform significance tests that require an interval scale. Over the decades there has been much debate about Stevens's prescriptions [56,95,45,62] and this debate has also spawn to the IR field with Fuhr [40] suggesting strict adherence to Stevens's prescriptions and Sakai [75] arguing for a more lenient approach.
Our vision is that it is now time for the IR field to accurately investigate and understand the scale properties of its evaluation measures and their implications on the validity of our experimental findings. As a matter of fact, we are not aware of any experimental IR paper that regarded evaluation measures as ordinal scales, thus refraining from computing (and comparing) means; also, most papers using evaluation measures apply parametric tests, which should be used only from interval scales onwards. This means that improper methods have been potentially applied. Independently from your stance in the above long-standing debate, the key question about IR experimental findings is: are we on the safe side or are we at risk? Are we in a situation like using a rubber band to measure and compare lengths? Are we facing a state of the affairs where decades of IR research may have produced questionable results?
We do not have the answer to these questions but our intention with this paper is to lay the foundations and set all the pieces needed to have the means and instruments to answer these questions and to let the IR community discuss these issues on a common ground in order to reach shared conclusions.
Therefore, the contributions of the paper are as follows: 1. introduction to the representational measurement theory [53,59,87], clearly explaining why (or why not) certain operations and significance tests should be permissible on a given scale and presenting the different stances on this long-standing debate; 2. introduction to the notion of meaningfulness [29,67,69], i.e. the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. To the best of our knowledge, this concept has never investigated or applied in IR but it is fundamental to the validity of the inferences we draw; 3. discussion and demonstration of further measurement issues, specific to IR and beyond the debate on permissible operations. In particular, we show how the recall base and the length of the run may make averaging across topics (or other forms of aggregate statistics) problematic, at best; 4. proposal of a straightforward and powerful approach for turning an evaluation measure into an interval scale, by transforming its values into their rank position. In this way, we provide a means for improving the meaningfulness and validity of our inferences, still preserving the different user models embedded by the various evaluation measures; 5. experimental evaluation of the differences between using the original measures and the intervalscaled ones, by relying on several TREC collections. For all the regarded measures -namely Precision, Recall, AP, DCG, Normalized Discounted Cumulative Gain (nDCG), Rank-Biased Precision (RBP), and Reciprocal Rank (RR) -we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but overall, on average, we observed a 25% change in decisions about what is significant and what is not.
The paper is organized as follows: Section 2 provides an overview of the representational theory of measurement, of the different types of scale, and the notion of meaningfulness. Section 3 deeply discusses measurement and meaningfulness issues specific to IR. Section 4 briefly summarizes related works. Section 5 explains our methodology for transforming evaluation measures into interval scales. Section 6 introduces the experimental setup while Section 7 discusses the results of the experiments. Finally, Section 8 draws some conclusions and outlooks for future works.

Overview
The representational theory of measurement [53,87,59] is one of the most developed approaches to measurement, suitable for many areas of science ranging to physics and engineering to psychology. The basic idea is that real world objects have attributes which constitute their relevant features and induce a set of relationship among them; the set of objects E together with the relationships R E 1 , R E 2 , . . . among them comprise the so-called Empirical Relational System (ERS) E = E, R E 1 , R E 2 , . . . . Then, we look for a mapping between the real word objects E and numbers N in such a way that the relationships R E 1 , R E 2 , . . . among the objects match with relationships R N 1 , R N 2 , . . . among numbers; the set of numbers N together with the relationships R N 1 , R N 2 , . . . constitutes the so-called Numerical Relational System (NRS) N = N, R N 1 , R N 2 , . . . . More precisely, the representational theory of measurement seeks for an homomorphism φ which maps E onto N in such a way that ∀R E i , ∀e 1 , e 2 , . . . , e k ∈ E | (e 1 , e 2 , . . . , e k ) ∈ R E i it holds that ∃ n 1 = φ(e 1 ), n 2 = φ(e 2 ), . . . n k = φ(e k ) ∈ N | (n 1 , n 2 , . . . , n k ) ∈ R N i . The homomorphism φ is called a scale of measurement. Note that, in general, we seek for an homomorphism and not an isomorphism because two different real word objects might be mapped into the same number.
The most typical example is length. Suppose the ERS E = E, , • is a set of rods with an order relationship among rods and a concatenation operation • among them. If the attribute under examination is the length of a rod, we can map the ERS to the NRS N = R + 0 , ≥, + such that ∀e 1 , e 2 , e 3 ∈ E it holds e 1 e 2 ⇔ φ(e 1 ) ≥ φ(e 2 ) and e 1 • e 2 ∼ e 3 ⇔ φ(e 1 ) + φ(e 2 ) = φ(e 3 ), that is if a rod is longer than another one the number assigned to the first one is bigger than the number assigned to the second one and the concatenation of two rods corresponds to the sum of the two numbers assigned to them.
The core of the representational theory of measurements is to seek for a representation theorem and a uniqueness theorem for the scale of measurement in order to fully define it.
The representation theorem ensures that if the ERS satisfies given properties, it is possible to construct an homomorphism to a certain NRS. In the previous example, the representation theorem defines which properties the order relation and the concatenation • have to satisfy in order to construct a real-valued function φ which is order preserving and additive. It is important to underline that the representational theory of measurement seeks for "operations" among real word objects -e.g. we can put two rods side by side to order them or we can lay two rods end by end to concatenate them -and if these "operations" satisfy given properties they can be reflected into corresponding operations among numbers, where numbers are just a proxy of what happens among real world objects but are much more convenient to manipulate.
In general, given an ERS and an NRS, it is possible to create more than one homomorphism between them. For example, it is possible to express length by using meters or yards and both of them are legitimate scales for length. The uniqueness theorem is concerned with determining which are the permissible transformations φ → φ such that φ and φ are both homomorphisms of the given ERS into the same NRS. In our example, any transformation φ = αφ, α > 0 is permissible for length. Therefore, the uniqueness theorem guarantees that the "structure" of a scale of measurement is invariant to changes in the numerical assignment, which preserve the relationships.

Classification of the Scales of Measurement
Stevens [85] introduced a classification of scales based on their permissible transformations, described below.

Nominal scale
It is used when entities of the real world can be placed into different classes or categories on the basis of their attribute under examination. The ERS consists only of different classes without any notion of ordering among them and any distinct numeric representation of the classes is an acceptable measure but there is no notion of magnitude associated with numbers. Therefore, any arithmetic operation on the numeric representation has no meaning.
The class of permissible transformations is the set of all one-to-one mappings, i.e. bijective functions: φ = f(φ), since they preserve the distinction among classes.
Example 1 (Nominal Scale). Consider a classification of people by their country, e.g. France, Germany, Greece, Italy, Spain, and so on. We could define the two following measurements: if France 13 if Germany −10 if Greece 23 if Italy 17 if Spain · · · if · · · both φ and φ are valid measures, which can be related with a one-to-one mapping. Note that even if φ looks like being ordered, there is actually no meaning in the associated magnitudes and so it should not be confused with an ordinal scale. Moreover, even if it is alway possible to operate with numbers, using φ and performing 4 − 3 = 1, which would correspond to Germany − Greece ?
= Spain, has no specific meaning, as well as using φ and performing 13 − (−10) = 23, which would correspond to Germany − Greece ? = Italy, even in disagreement with the previous case.

Ordinal scale
It can be considered as a nominal scale where, in addition, there is a notion of ordering among the different classes or categories. The ERS consists of classes that are ordered with respect to the attribute under examination and any distinct numeric representation which preserves the ordering is acceptable. Therefore, the magnitude of the numbers is used just to represent the ranking among classes. As a consequence, addition, subtraction or other mathematical operations have no meaning.
The class of permissible transformations is the set of all the monotonic increasing functions, since they preserve the ordering: φ = f(φ).
Example 2 (Ordinal Scale). The European Commission Regulation 607/2009 [27] and the follow-up regulation 2019/33 [28] set the following increasing scale to classify sparkling wines on the basis of their sugar content: • pas dosé (brut nature): sugar content is less than 3 grams per litre; let us call this range s 0 = [0, 3]; • extra brut: sugar content is between 0 and 6 grams per litre; let us call this range s 1 = [0, 6]; • brut : sugar content is less than 12 grams per litre; let us call this range s 2 = [0, 12]; • extra dry: sugar content is between 12 and 17 grams per litre; let us call this range s 3 = (12, 17]; • sec (dry): sugar content is between 17 and 32 grams per litre; let us call this range s 4 = (17, 32]; • demi-sec (medium dry): sugar content is between 32 and 50 grams per litre; let us call this range s 5 = (32, 50]; • doux (sweet): sugar content is greater than 50 grams per litre; let us call this range s 6 = (50,2000], where 2000 grams per litre is roughly the saturation of sugar in water, which is much higher than those of sugar in alcohol.
We can introduce two alternative ordinal scales φ and φ of the above wine scale where φ is given by the maximum of a range while φ is given by a monotonic transformation φ = φ 2 : if sec 2500 if demi-sec 4000000 if doux As in the case of the previous Example 1, mathematical operations have no specific meaning, even if, especially in the case of φ, we may be tempted to perform operations like brut extra brut = 12 6 = 2 to express statements like "brut may be twice as sweet as extra brut". However, such statement cannot be expressed on the φ or φ scale and it actually comes from implicitly changing scale to the mass concentration scale of the solution, which is a ratio scale (see below) where the division operation would make sense. Also addition and subtraction have no meaning, so brut − extra brut = 12 − 6 = 6 is not a way to express statements like "brut may have 6 g /l of sugar more than extra brut", for the same reasons above. We could perform operations such as sgn(φ(e 1 ) − φ(e 2 )) or sgn(φ (e 1 ) − φ (e 2 )) but this would be just a more involute way of expressing the order among categories, which is the only property guaranteed by ordinal scales.

Interval scale
Besides relying on ordered classes, it also captures information about the size of the intervals that separate the classes. The ERS consists of classes that are ordered with respect to the attribute under examination and where the size of the "gap" among two classes is somehow understood; more precisely, fundamental to the definition of an interval scale is that intervals must be equi-spaced. An interval scale preserves order, as an ordinal one, and differences among classes have meaning -but not their ratio. Therefore, addition and subtraction are acceptable operations but not multiplication and division.
The class of permissible transformations is the set of all affine transformations: φ = αφ + β, α > 0. Note that while ratios of classes φ(e1) φ(e2) have no meaning on an interval scale, the ratio of differences among classes, i.e. the ratio of intervals, is allowed and Example 3 (Interval Scale). A typical example of interval scale is temperature, which can be expressed on either the Fahrenheit or the Celsius scale, where the affine transformation F = 9 5 C + 32 allows us to pass from one to the other. When talking about temperature it does not make sense to say that 20 • C is twice as hot as 10 • C, i.e. multiplication and division are not allowed; you can also note that the division operation is not invariant to the transformation, since 20 • C 10 • C = 2 but 68 • F 50 • F = 1.36. However, it makes sense to say that the increase between 10 • C and 20 • C is the same as the increase between 20 • C and 30 • C, i.e. addition and subtractions are allowed; you can also note that the subtraction operation is invariant to the transformation since 30 Central to the notion of temperature is the fact that the size of the "gap" has the same meaning all over the scale; indeed, 1 degree represents the same amount of thermal energy all over the scale. i.e. the gaps are equi-spaced.

Ratio scale
It allows us to compute ratios among the different classes. The ERS consists of classes that are ordered, where there is a notion of "gap" among two classes and where the "proportion" among two classes is somehow understood. It preserves order and differences as well as ratios. Therefore, all the arithmetic operations are allowed.
The class of permissible transformations is the set of all linear transformations: φ = αφ, α > 0.
Example 4 (Ratio Scale). A typical example of ratio scale is length which can be expressed on different scales, e.g. meters or yards, which can all be mapped one into another via a similarity transformation. For example, to pass from kilometers (φ) to miles (φ ), we have the following transformation φ = 0.62φ. Another example of ratio scale is the absolute temperature on the Kelvin scale where there is a zero element, which represents the absence of any thermal motion.

Admissible Statistical Operations
Stevens moved a step forward and linked the notion of scale with that of admissible statistical operations which can be carried out with that scale: • Nominal scale: the only allowable operation is counting number of items in each class, that is, in statistical terms, mode and frequency.
• Ordinal scale: besides the operations already allowed for nominal scales, median, quantiles, and percentiles are appropriate, since there is a notion of ordering.
• Interval scale besides the operations already allowed for ordinal scales, mean and standard deviation are allowable since they depend just on sum and subtraction 5 .
• Ratio scale: besides the operations already allowed for interval scales, geometric and harmonic mean, as well as coefficient of variation, are allowable since they depend on multiplication and division.
These prescriptions originated several debates over the decades. Lord [56, p. 751] argued that "since the numbers don't remember where they come from, they always behave the same way, regardless" and so any operation should be allowed even on "football numbers", i.e. a nominal scale; Gaito [42] reinforced this argument by distinguishing between the realm of the measurement theory, where Stevens's restrictions should apply, and the realm of the statistical theory, where these restrictions should not be applied, since other assumptions, such as normal distribution of the data, are those actually needed. Townsend and Ashby [89] replied back showing cases where performing operations inadmissible for a given scale of measurement may mislead the conclusions drawn by statistical tests. O'Brien [68] discussed the type of errors introduced when using ordinal data for representing an underlying continuous variable, classifying them into pure transformation errors, pure categorization errors, pure grouping errors, and random measurement errors. Velleman and Wilkinson [95] summarized the previous debate and argumented that once you are in the numerical realm every operation is admissible among numbers. Recently, Scholten and Borsboom [80] made a case of flaws in the original Lord's argument and, as a striking consequence, Lord's experiment would not be a counterargument to Stevens's restrictions but it would rather comply with them. In a very recent textbook, Sauro and Lewis [78] firmly supported Lord's view, at least in the case of ordinal scales, but with the caveat to not make claims on the outcomes of a statistical test that violate the underlying scale. So, for example, if you are on ordinal scale and you detected a significant effect using a test which would require a ratio scale, you should not claim that that effect is twice as big as another effect but just that it is significant.

Meaningfulness
The above observation brings the debate back to the core issue of what we should pay attention to. Indeed, both Hand [45] and Michel [62,63] argued that the problem is not what operations you can perform with numbers but what kind of inference you wish to make from those operations and how much such inference has to be indicative of what actually happens among real world objects. Already Adams et al. [2, pp. 99-100] explicitly stated that Statistical operations on measurements of a given scale are not appropriate or inappropriate per se but only relative to the kinds of statements made about them. The criterion of appropriateness for a statement about a statistical operation is that the statement be empirically meaningful in the sense that its truth or falsity must be invariant under permissible transformations of the underlying scale These statements opened the way to the development of a full (formal) theory of meaningfulness [29,67,69], which is a central concept to clearly shape and define the questions discussed above: according to the adopted measurement scales, what processing, manipulation, and analyses can be conducted and what can we tell about the conclusions drawn from such processing?
Note that the statement "A mouse weights more than an elephant" is meaningful even if it is clearly false; indeed, its truth value, i.e. false, does not change whatever weight scale you use (kilograms, pounds, and so on). Therefore, as anticipated above, meaningfulness is a distinct concept from the one of truth of a statement and it is somehow close to the notion of invariance in geometry, since the truth value of a statement stays the same independently from the permissible scales used to express it. In both cases, the statement "Most people come from Spain" is meaningful since, if we compute the mode of the values, it is 1 in the case of φ and 17 in the case of φ which both correspond to Spain. On the other hand, the statement "The lowest quartile consist of Spanish people" is not meaningful, since it is true with 1 corresponding to Spain in the case of φ but is is false with 13 corresponding to Germany in the case of φ . Indeed, the first statement about the mode involves just counting, which is an allowable operation for a nominal scale, while the second statement about the lowest quartile requires a notion of ordering not present in a nominal scale.
Example 6 (Meaningfulness for an Ordinal Scale -Example 2 continued). Suppose that we have two wineries X and Y . The first winery W X produced five bottles as follows: extra brut, extra brut, brut, extra dry, and sec; the second one W Y produced five bottles as follows: pas dosé, pas dosé, pas dosé, brut, and demi-sec. Therefore, according to the scale φ, we have φ(W X ) = [6 6 12 17 32] and φ(W Y ) = [3 3 3 12 50]; while according to the scale φ , we have φ (W X ) = [36 36 144 289 1024] and φ (W Y ) = [9 9 9 144 2500]. The statement "The median of the first winery is greater than the one of the second winery" is meaningful since 12 > 3 according to φ is true as well as 144 > 9 according to φ ; so we could safely say that the first winery produces a little more brut-like wines than the second one, focusing on a more standard product. On the other hand, the statement "The average of the first winery is greater than the one of the second winery" is not meaningful since 14.6 > 14.2 according to φ is true but 305.8 > 534.2 according to φ is false, which would lead us to draw basically opposite conclusions based on the scale we use. Indeed, the first statement about the median involves just the notion of ordering which is allowable on an ordinal scale, while the second statement about the average requires to sum values, which is not an allowable operation.
Example 7 (Meaningfulness for an Interval Scale -Example 3 continued). The statement 'Today the difference in temperature between Rome and Oslo is twice as high as it was one month ago" is meaningful. Indeed, if, on the Celsius scale, the temperature today in Rome is 20 • C and in Oslo is 10 • C while one month ago it was 12 • C and 7 • C, leading to 20 − 10 = 10 which is twice as 12 − 7 = 5, on the Fahrenheit scale we would have 68 − 50 = 18 which is twice as 53.6 − 44.6 = 9.
Suppose now that we have recorded two sets of temperatures from Paris and Rome: The statement "The median temperature in Paris is the same as in Rome" is meaningful, since 4 = 4 in Celsius degrees and 39.2 = 39.2 in Fahrenheit degrees; this is due to the fact that interval scales are also ordinal and quantiles are an allowable operation on ordinal scales.
The statement "The mean temperature in Paris is less than in Rome" is meaningful as well, since 10.4 < 11.2 in Celsius degrees and 50.72 < 52.16 in Fahrenheit degrees; this is due to the fact that addition and subtraction are allowable operations on an interval scale and, as a consequence, mean is invariant to affine transformations. Indeed, let X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y n } be two set of values on an interval scale; it holds that Therefore, the statement "The mean of X is greater than the mean of Y " is always meaningful. Finally, the statement "The geometric mean of temperature in Paris is greater than in Rome" is not meaningful, since 5.40 > 5.27 in Celsius degrees and 46.74 < 48.17 in Fahrenheit degrees; this is due to the fact that the geometric mean involves the multiplication and division of values, which is not a permitted operation on an interval scale.
Also note that we may be tempted to compare the results of the arithmetic mean with those of the geometric mean to gain "more insights". For example, we might observe that the arithmetic mean in Paris is less than in Rome -10.4 < 11.2 in Celsius degrees -but the opposite is true when we consider the geometric mean -5.40 > 5.27 in Celsius degrees. We might thus highlight that this due to the fact that the first (and lowest) value 2 in Paris is double than 1 in Rome and that the geometric mean rewards gains at lowest values; on the other hand, the arithmetic mean rewards gains at higher values and thus 8 in Paris is (almost) half than 15 in Rome and it contributes less. However, while the explanation why the geometric mean may differ from the arithmetic one is surely credible, the issue here is that the geometric mean could not be relied upon, as well as conclusions drawn from it, since it is based on operations not allowed on an interval scale; indeed, if we consider exactly the same temperatures just on the Fahrenheit scale, we would reach opposite conclusions.
Example 8 (Meaningfulness for a Ratio Scale -Example 4 continued). If the air distance between Rome and Padua is (about) 400 kilometers and the air distance among Rome and Oslo is (about) 2,000 kilometers, the statement "Rome and Oslo are five times as distant as Rome and Padua" is meaningful, even expressed in miles, since 248.54 5 · 1,242.74.
On the Kelvin scale for temperature, it does make sense to say that a thing is twice as hot as another thing if, for example, the first one is 273 K (almost 0 • C, 32 • F) and the second one is 546 K (almost 273 • C, 523.4 • F); you can note, however, how this statement does not hold if we consider Celsius and Fahrenheit degrees, since 0 32 = 0 while 273 523.4 = 0.52 (and none of them is exactly twice). Finally, let us show that the statement "The geometric mean of X is greater than the geometric mean of Y " is always meaningful. Indeed, let X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y n } be two set of values on a ratio scale; it holds that

Statistical Significance Testing
Siegel [83] and Senders [82] have discussed the implications of Stevens' classification and permissible operations in the case of statistical inference and parametric and nonparametric statistical significance tests. We consider the following tests: • Sign Test [44] is a non parametric test which looks at the signs of the differences among two paired samples x i and y i ; the null hypothesis is that the median of the differences is zero.
The sign test requires samples to be on an ordinal scale, since it needs to determine the sign of their difference or, equivalently, which one is greater. Note that the sign test discards the tied samples, i.e. when x i = y i .
• Wilcoxon Rank Sum Test (or Mann-Whitney U Test) [104,44] is a non parametric test which looks at the ranks of two paired samples x i and y i ; the null hypothesis is that the two samples have the same median.
The Wilcoxon rank sum test requires samples to be on an ordinal scale, since it needs to order them for determining their rank.
• Wilcoxon Signed Rank Test [104,44] is a non parametric test which looks at the signs and ranks of the differences among two paired samples x i and y i ; the null hypothesis is that the median of the differences is zero.
The Wilcoxon signed rank test requires samples to be on an interval scale, since it regards the ranks of the differences, for which intervals must be equi-spaced. Note that the Wilcoxon signed rank test discards the tied samples, i.e. when x i = y i .
• Student's t Test [86] is a parametric test for the null hypothesis that two paired samples x i and y i come from a normal distribution with same mean and unknown variance.
The Student's t test requires samples to be on an interval scale, since it needs to compute means and variances.
• ANalysis Of VAriance (ANOVA) [37,55] is a parametric test for the null hypothesis that q samples come from a normal distribution with same mean and unknown variance.
ANOVA requires samples to be on an interval scale, since it needs to compute means and variances.
• Kruskal-Wallis Test [54,44] is a nonparametric version of the one-way ANOVA for the null hypothesis that q samples come from a distribution with same median. It is based on the ranks of the different samples and it can be considered as an extension of the Wilcoxon rank sum test to the comparison of multiple systems at the same time.
The Kruskal-Wallis test requires samples to be on an ordinal scale, since it needs to order them for determining their rank.
• Friedman Test [38,39,44] is a nonparametric version of the two-way ANOVA for the null hypothesis that the effects of the q samples are the same. It is based on the ranks of the different samples.
The Friedman test requires samples to be on an ordinal scale, since it needs to order them for determining their rank.
As in the case of Stevens' permissible operations, defining which statistical significance tests should be permitted on the basis of the scale properties of the investigated variables raised a lot of discussion and controversy. Anderson [10], along the line of reasoning of Lord, argued that statistical significance tests should be used regardless of scale limitations. Gardner [43] summarizes much of the discussion up to that point, leaning towards not worrying too much about scale assumptions, and suggests that, if and when lack of compliance to measurement scale requirements biases the outcomes of significance tests, transformations can be applied to turn ordinal scales into more interval-like ones such as, for example, averaging the ranks of each score, as proposed by Gaito [41], or using a more complex set of rules, as developed by Abelson and Tukey [1]. Ware and Benson [102] replied to Gardner's positions by further revising the pro and con arguments and concluding that parametric significance tests should be used only when dealing with interval and ratio scales while, in the case of ordinal scales, nonparametric significance tests should be adopted. Townsend and Ashby [89] further investigated the issue, highlighting some serious pitfalls you may fall in, when ignoring the scale assumptions.
We can summarise the discussion with the conclusions of Marcus-Roberts and Roberts [61, p. 391]: The appropriateness of a statistical test of a hypothesis is just a matter of whether the population and sampling procedure satisfy the appropriate statistical model, and is not influenced by the properties of the measurement scale used. However, if we want to draw conclusions about a population which say something basic about the population, rather than something which is an accident of the particular scale of measurement used, then we should only test meaningful hypotheses, and meaningfulness is determined by the properties of the measurement scale in connection with the distribution of the population. Let us start our discussion by considering a not-exhaustive list of core IR areas where scales may matter. The most common and basic operation we perform to understand whether a system A is better than a system B is to average their performance over a set of topics and compare these aggregate scores. According to the discussion so far this leads to meaningful statements only if IR evaluation measures are, at least, interval scales.
Topic difficulty [19] is another central theme in IR because of its importance for adapting the behaviour of a system to the topic at hand. Voorhees [98,99], in the TREC Robust tracks, explored how to evaluate and compare systems designed to deal with difficult topics and proposed to use the geometric mean, instead of the arithmetic one, for Average Precision (AP) [15]. However, the use of a geometric mean further raises the requirements for the evaluation measures, even calling for a ratio scale.
Statistical significance testing has a long story of adoption and investigation in IR, from the early uses of t-test reported by Salton and Lesk [77], to the discussion on the compliance with the distribution assumptions of significance tests by van Rijsbergen [93], to advocating for a more wide-spread adoption of different types of significance tests by Hull [49], Savoy [79], Carterette [21], Sakai [72], to surveys on the current state of adoption of significance tests by Sakai [74]. Again, drawing meaningful inference depends on the appropriate use of parametric or nonparametric tests in accordance with the scale properties of the adopted IR evaluation measures.
Several authors have proposed the use of score transformation and standardisation techniques, such as z-score by Webber et al. [103] and other types of linear (and non-linear) transformations by Sakai [73], Urbano et al. [91], in order to compare performance across collections and to reduce the impact of few topics skewing the performance distribution. However, in order to ensure meaningful conclusions from these transformation, at least an interval scale would be required.
Despite so many aspects of IR evaluation which can be affected by the scale properties of evaluation measures and despite the deep scrutiny that the above techniques have received over the years, there has been much less attention to the implications of the scale assumptions on them.
Robertson [70] was the first to discuss the admissibility of the use of the geometric mean from the Stevens's perspective in the context of the TREC Robust track. In particular, Robertson focused on the fact that Mean Average Precision (MAP) and Geometric Mean Average Precision (GMAP) may lead to different conclusions -e.g. blind feedback is beneficial according to MAP but detrimental according to GMAP -and which of them may hold more (intrinsic) validity. In this respect, Robertson [70, p. 80] observed that If the interval assumption is not valid for the original measure nor for any specific transformation of it, then any monotonic transformation of the measure is just as good a measure as the untransformed version. If we believe that the interval assumption is good for the original measure, that would give the arithmetic mean some validity over and above the means of transformed versions. If, however, we believe that the interval assumption might be good for one of the transformed versions, we should perhaps favour the transformed version over the original. But if there is no particular reason to believe the interval assumption for any version, then all versions are equally valid. If they differ, it is because they measure different things.
Since both AP and the log-transformation of AP (implied by the geometric mean) are not interval scales, Robertson concluded that no preference could be granted to MAP or GMAP in terms of (intrinsic) validity of their findings. In this way Robertson takes a neutral stance with respect to the debate on whether certain operations should be permitted or not on the basis of the scale properties.
Note that Robertson somehow implicitly indicates transformations as a possible means to turn a not-interval scale into an interval one, as also supported by Gaito [41], Abelson and Tukey [1].
As a final remark, even if Robertson did not mention it explicitly, his reasoning seems to be loosely related the concept of meaningfulness when he says [p. 80] Good robustness would be indicated if the conclusions looked the same whatever transformation we used; if we found it easy to find transformations which would substantially change the conclusions, then we might infer that our conclusions are sensitive to the interval assumption, and that the different transformations measure different things in ways that may be important to us still keeping a neutral stance about what should or should not be done.
Fuhr [40] took a firm position and argued that Mean Reciprocal Rank (MRR) [84] should not be computed because: 1. in general, RR is just an ordinal scale and, according to Stevens means cannot be computed for a ordinal scales; 2. in particular, RR has some counter-intuitive behaviour. On the other hand, Sakai [75] has recently disagreed with Fuhr: 1. in general, on the fact that means should not be computed for an ordinal scale, using arguments similar to those discussed in Section 2.3; 2. in particular, on the use of RR which Sakai finds quite useful from a practical point of view.
Whatever stance you wish to take about whether (or not) operations should be constrained by scale properties, from the discussion so far, it clearly emerges that IR needs further and systematic investigation about the implications and impact of derogating from compliance with scale properties. Moreover, most of the above discussion is just about averaging values and does not tackles the implications for statistical significance testing. Finally, and more importantly, we completely lack a thorough discussion on and any adoption of the notion of meaningfulness in IR and this is quite striking for a discipline so strongly rooted in experimentation and so much based on inference.

A Formal Theory of Scale Properties for IR Evaluation Measures
Ferrante et al. [31,32,33] leveraged the representational theory of measurement for developing a formal theory of IR evaluation measures which allows us to determine the scale properties of an evaluation measure. In particular, they defined an ERS for system runs and used two basic operations -swap, i.e. swapping a relevant with a not-relevant document in a ranking, and replacement, i.e. substituting a relevant document with a not-relevant one -to study how runs are ordered. In this way, they demonstrated that there exists a partial order of runs where, when runs are comparable, all the measures agree on the same way of ordering them; however, when runs are not comparable, measures may disagree on how to order them. By using properties of the partial orders and theorems from the representational theory of measurement, they were able to define an interval scale measure φ and to check whether there is any linear transformation between such measure φ and IR evaluation measures, in order to determine if the latter are interval scales too.
In short, Ferrante et al. found that, for a single topic: • set-based evaluation measures: binary relevance: precision, recall, F-measure are interval scales; multi-graded relevance: Generalized Precision (gP) and Generalized Recall (gR) are interval scales only if the relevance degrees are on a ratio scale; • rank-based evaluation measures: binary relevance: Rank-Biased Precision (RBP) [65] is an interval scale only for p = 1/2; Average Precision (AP) is not an interval scale; multi-graded relevance: Graded Rank-Biased Precision (gRBP) is an interval scale only for p = G/(G + 1), where G is the normalized smallest gap between the gain of two consecutive relevance degrees, and the relevance degrees themselves are on a ratio scale; Discounted Cumulative Gain (DCG) [50] and Expected Reciprocal Rank (ERR) [23] are not interval scales.
Ferrante et al. [33] also studied what they called the induced total order, i.e. pretending that runs in the ERS are ordered by the actual values of a measure. Also in this case which is the most "favourable"  Figure 1 shows the Hasse diagram [26] which represents the partial order among all the runs of length N = 4. In the figure, vertices are runs while edges represent the direct predecessor relation that is, if r ≺ s, i.e. r and s are comparable, then r is below s in the diagram. Note that if r and s lie on the same horizontal level of the diagram, then they are incomparable; furthermore, elements on different levels may be incomparable as well. In the example (1, 1, 0, 1) (1, 1, 1, 0), (1, 1, 0, 0) (1, 1, 1, 0), and (1, 0, 1, 1) (1, 1, 1, 0) are all comparable; therefore, all IR measures agree on these runs and order them in the same way. On the other hand, (1, 1, 0, 0) and (1, 0, 1, 1) are not comparable, as well as (1, 1, 0, 0) and (0, 1, 1, 1), and IR measures disagree on how to order them; as a consequence, measures will order these runs differently, producing different Rankings of Systems (RoS).
The difference in the RoS produced by evaluation measures is what is studied when performing a correlation analysis among measures, e.g. by using Kendall's τ [52]; practical wisdom says that measures should be neither too much correlated -otherwise it practically makes no difference using one or the other -nor too few correlated -otherwise it may be an indicator of some "pathological" behaviour of a measure. Indeed, each evaluation measure embodies a different user model [20], i.e. a different way in which the user interacts with the ranked result list and derives gain from the retrieved documents, and the differences between the RoS produced by different evaluation measures, and as a consequence their Kendall's τ , may be considered as the tangible manifestation of such different user models. Note that the work by Ferrante et al. provides a formal explanation of what originates differences in Kendall's τ : for all the runs which are comparable in the Hasse diagram, Kendall's τ between different measures is 1, since all of them order these runs in the same way; for runs which are not comparable in the Hasse diagram, Kendall's τ between different measures is less than 1, since all of them order these runs differently; therefore, these not comparable runs are where user models differentiate themselves and can take a different stance.
However, these differences in the RoS are not causing IR evaluation measures to not be interval scales; they would just mean that IR evaluation measures are different scales. The real problem with IR evaluation measures is that their scores are not equi-spaced and thus they cannot be interval scales, as explained in Section 2.2. This issue is depicted in Figure 2 which shows how different measures -namely, Precision (and Recall 6 ), AP, RR, RBP with p ∈ {0.3, 0.5, 0.8}, and DCG with log base 2 -order and space the runs shown in the Hasse diagram of Figure 1.
We can observe that only Precision (Recall) and RBP with p = 0.5 produce equi-spaced values, while all the other measures violate this assumption, required to obtain an interval scale; in other  Average Precision Reciprocal Rank    Discounted Cumulated Gain, log base = 2  Figure 1 by different evaluation measures. Each blue square corresponds to a score of a given measure. On the right of the square, the run corresponding to that score is reported; in case of tied runs, i.e. runs for which the measures produces the same score, they are all listed on the right of the square. terms, Figure 2 visually represents the issue found by Ferrante et al. [33] even when using the induced total order. We can also note that all the measures agree only on the common comparable runsi.e. (0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) and (1, 1, 0, 1) (1, 1, 1, 0) (1, 1, 1, 1) -but, as soon as incomparable runs come into play, they start to disagree on how to order them. Finally, looking at Figure 2 we can notice how IR measures behave differently in violating the equi-spacing assumption. RBP with p ∈ {0.3, 0.8} and DCG follows a somehow regular pattern, where scores are not equi-spaced but they are in some way evenly clustered and they are symmetric if you fold the figure along its middle horizontal axis; on the other hand, AP and RR follow a much more irregular and not symmetric pattern.

Rank-biased Precision
We can also note how these measures spread values in their range differently. Precision (and Recall) and DCG spread their values all over the possible range while this is not always the case with RBP. Indeed, RBP assumes runs of infinite length and normalizes by the 1 1−p factor. However, we deal with runs of limited length and the 1 1−p factor is an overestimation, the bigger the overestimation the bigger is the value of p and the smaller is the length of the run -this is more clearly visible in the case of RBP with p = 0.8 in Figure 2f. Finally, AP, RBP with p = 0.3, and RR, i.e. those measures farther from being interval scales, leave large portions of their possible range completely unused. In particular, AP leaves one quarter of its range unused, in the top part roughly corresponding to the first quartile of the possible values; RR leaves one half of its range unused, in the top part roughly corresponding to the first and second quartiles of the possible values; and, finally, RBP with p = 0.3 leaves half of its range empty, in the middle part roughly corresponding to the second and third quartile of the possible values.
Why does it matter how much equi-spaced the values are and how they are spread over their range? Consider a random variable X that takes values in the set {0, 1, 2, 4, 13}. Even if all these five values can be obtained with equal probability, i.e. the random variable is uniform, the mean and the median of the variable differ, being the mean equal to 4 and the median to 2. This shows how the lack of equi-spacing causes some sort of "imbalance" even in the case of a uniform variable, which may be an undesirable situation from the measurement point of view, at least if not explicitly considered and accounted for. Furthermore, when we compute P[X ∈ (x − ε, x + ε)], i.e. the probability that the value of X is equal to x with an error of at most 2ε, this function is not constant all over the range but it is assumer greater for values around {0, 1, 2} than for those around {4, 13}. As a consequence, a similar accuracy in approximating the value of X produces a different precision in the measurement depending on the value x that we are considering. Note that in the present toy model it happens for ε ≥ 1, but a suitable modification of the present model can produce the same behaviour for any ε > 0 set in advance.
As a further example, let us consider a measure with a limited range of equi-spaced values. If we draw a set of random values taken from this range and consider its arithmetic mean, by the law of large numbers, we have that this mean converges to the middle point of the range interval. This property is independent from the distance among the subsequent values, i.e. the unit of measurement chosen. So we can use such a procedure -the convergence of the mean towards the middle of the range -in order to "calibrate" the measuring instrument, independently from the specific unit of measurement chosen. This is no more possible if we have values which are not equi-spaced.
Example 9 (Effect of RR not being equi-spaced). Let us assume that we have two queries and two systems. System A returns the first relevant document at ranks 1 and 4, respectively, while system B finds the relevant answers in both cases at rank 2. Computing the MRR of the two systems, i.e. the average value of the RR, we get MRR(A)= 1 2 (1/1 + 1/4) = 0.625, while MRR(B)=0.5, telling us that system A is better than B. However, if instead of reciprocal rank, we regard the ranks themselves, we have equi-spaced values forming an interval scale (actually, even a ratio scale). In our example, system A finds the first relevant item on average at rank 2.5, which is worse than the average rank 2 of system B -so we would get the opposite finding when we use a scale still based on the rank of relevant documents but properly equi-spaced. Table 1: Example for AP not being equi-spaced Example 10 (Effect of AP not being equi-spaced). Table 1 shows an example of two system pairs (A,B) and (C,D) and two queries, for which we compute AP values. In the first case, AP will say that A performs better than B, while in the second case, C is worse than D. Why is this effect related to AP not being on an interval scale? Because in both the examples, the runs retrieved by the two systems for a given topic have the same relevance degrees in the first two positions and just a swap of a relevant with a not-relevant document in the last two positions. So, to the same loss of relevance for the swap in the last two positions, still keeping the same relevance in the first two positions, AP "reacts" in one case telling us that system A is better than system B, and in the second case that D is better than C and this is also due to the not-equispaced values of AP, e. g. runs ranked 13 and 14 are much closer than runs ranked 10 (on the left branch of Figure 1) and 9, as shown in Figure 2b. Note that here we are neither questioning the top-heaviness of a measure nor its capability of reflecting user preferences but rather we point out how the lack of equi-spaced values affects the assessment supported by a measure.
The fact that IR evaluation measures, apart from Precision, Recall, and RBP with p = 0.5, are not interval scales leads to the general issues with computing means, statistical tests, and meaningfulness discussed in Sections from 2.2 to 7.3 and shown in Examples 3 and 7. In addition, Examples 9 and 10 above show how the lack of equi-spacing may also lead to statements like "system A is better than B" (or viceversa) which are not always intuitive all over the scale.

Averaging across Topics and Correlation Analysis Revisited
The fact that Precision and Recall are interval scales makes addition and subtraction permissible operations and, as a consequence, computing arithmetic means permissible too. Therefore, it is safe to average performance of IR systems across topics when we use Precision and Recall. But is that really true?
As said, Ferrante et al. [33] have found an interval scale φ, called Set-Based Total Order (SBTO), and have shown that both Precision and Recall are an affine transformation of this interval scale and thus also an affine transformation of each other. Ferrante et al. [34] have raised this question: if Precision and Recall are transformations of the same interval scale, they are ordinal scales too and they should rank systems in the same way. Therefore, if they produce the same RoS, Kendall's τ correlation between them should be 1. So, why their Kendall's τ correlation is 0.8588, using the TREC 8 Ad-hoc data?
Let us consider how correlation analysis between evaluation measures works. Given two rankings X and Y , their Kendall's τ correlation is given by τ X, Y = P −Q P +Q+T P +Q+U , where P is the total number of concordant pairs (pairs that are ranked in the same order in both vectors), Q the total number of discordant pairs (pairs that are ranked in opposite order in the two vectors), T and U are the number of ties, respectively, in the first and in the second ranking. τ ∈ [−1, 1] where τ = 1 indicates two perfectly concordant rankings, i.e. in the same order, τ = −1 indicates two fully discordant rankings, i.e. in opposite order, and τ = 0 means that 50% of the pairs are concordant and 50% discordant.
The typical way of performing correlation analysis is as follows: let φ 1 and φ 2 be two evaluation measures; in our case, φ 1 is Precision and φ 2 is Recall. Let Φ 1 and Φ 2 be two T × S matrices where each cell contains the performance on topic i of system j according to measures φ 1 and φ 2 , respectively. Therefore, Φ 1 and Φ 2 represent the performance of S different systems (columns) over T topics (rows). Let Φ 1 and Φ 2 be the column-wise averages of the two matrices, i.e. the average of the performance of each system across the topics. If you sort systems by their score in Φ 1 and Φ 2 , you obtain two RoS corresponding to φ 1 and φ 2 , respectively, and you can compute Kendall's τ correlation between these two RoS. This is the traditional way for computing the correlation between two evaluation measures and Ferrante et al. call it overall correlation, since it first computes the average performance across the topics and then it computes the correlation between evaluation measures. This approach leads to a Kendall's τ correlation of 0.8588 between Precision and Recall.
Ferrante et al. proposed a different way of computing the correlation, called topic-by-topic correlation, where, for each topic i, they consider the RoS on that topic corresponding to φ 1 and the one corresponding to φ 2 , i.e. they consider the i-th rows of Φ 1 and Φ 2 , respectively; they then compute Kendall's τ correlation among the two RoS on that topic. Therefore, they end-up with a set of T correlation values, one for each topic. Using, this way of computing correlation, Ferrante et al. found that Kendall's τ correlation between Precision and Recall is always 1 for all the topics and this was the result expected for two interval scales which order systems in the same way.
Therefore, if you consider each topic alone, Precision and Recall are just a transformation of the same interval scale, as Celsius and Fahrenheit are, and their Kendall's τ correlation is 1. However, if you first average across topics, which should be a permitted operation for interval scales, and then you compute Kendall's τ correlation, it stops to be 1. This was somehow surprising and unexpected. Indeed, as an example from another domain, if you take a matrix of scores in Celsius degrees and another one with the corresponding Fahrenheit degrees, their Kendall's τ correlation is always 1, either if you compute it row-by-row (i.e. our topic-by-topic correlation) or if you first average across rows and then compute it (i.e. our overall correlation).
Ferrante et al. [34, p. 305] explained this behaviour as due to the recall base: Recall heavily depends on the recall base which changes for each topic and it is used to normalize the score for each topic; therefore, in a sense, recall on each topic changes the way it orders systems We further investigate this issue in Section 3.4 below, where we provide details and demonstrations, but here we summarise the sense of our findings. The difference between overall and topic-by-topic correlation is basically due to the fact that we are using different interval scales for each topic. These scales are indeed transformations of one in the other for each topic and this is why topic-by-topic correlation is 1; however, since we are changing scale from one topic to another, when average across topics we are mixing different scales and this is why the overall correlation is different from 1.
Example 11 (Recall corresponds to different scales on different topics). Let us consider Recall and let us assume that we have three queries q 1 , q 2 , q 3 , with one, two and three relevant documents, respectively. Then, the possible values of Recall are as follows: for q 1 we have 0 and 1; for q 2 we have 0, 1 2 and 1; and for q 3 we have 0, 1  are not possible for our three example topics, and impossible values are not considered in the definition of the equidistance property of interval scales. Only if we had a fourth query with six relevant documents, this scale would be ok. In most cases, however, no such scale exists, and so the aggregated scale is not an interval one.
The fact that we may be changing scale from topic to topic has very severe consequences. All the debate originated by Stevens's permissible operations and the possibility of averaging only from interval scales onwards has always been based on the obvious assumption that the averaged values were all drawn from the same scale; no one has ever doubted that it is not possible to average values coming from different scales because this would be like mixing apples with oranges. So, what is the meaningfulness of typical statements like "System A is (on average) better than system B" when we are not only violating the interval scale assumptions but, even more seriously, we are mixing different scales? What about the meaningfulness of typical statements like "System A is significantly better than system B"? The debate between using parametric or nonparametric tests concerns how much you wish to comply with the interval scale assumptions but, undoubtedly, all the significance tests, when aggregating across values, expect them to be drawn from the same scale.
If we wish to make an analogy, it is like the difference between using mass and weight, being Precision similar to mass and Recall to weight. It would be somehow safe to average the mass of bodies coming from different planets but it would not to average their weight, due to the different gravity on the different planets. The recall base is what changes the gravity from planet/topic to planet/topic in the case of Recall.
However, even Precision is not completely "safe" because, when the length of the run changes, its scale changes as well. As a consequence we may end up using different scales from one run to another and this can happen not only across topics, as in the case of Recall, but also within topics, if we have two or more runs retrieving a different number of documents for that topic. This statements affects the evaluation of classical Boolean retrieval, where Precision and Recall are computed for the set of retrieved documents for each query, followed by averaging over all queries. So we have to conclude that this procedure is seriously flawed. Luckily, in most of today's evaluations, the length of the run has a much smaller effect because, in typical TREC settings, almost all the runs retrieve 1,000 documents for each topic and just few of them retrieve less documents; this effect would also (practically) disappear when you consider Precision at lower cut-offs, like P@10, when it is almost guaranteed that all the runs retrieve 10 documents.
Summing up, independently from an evaluation measures being an interval scale or not, the recall base (greatly) and the length of the run (less) cause the scale to change from topic to topic and/or from run to run. This makes averaging across topics, as well as other forms of aggregation used in significance tests, problematic at best. We show how and why this happens in the case of Precision (Section 3.4.1) and Recall (Section 3.4.2), which are the simplest measures you can think of, since they change either the length of the run or the recall base alone. We also consider the more complex case of the F-measure (Section 3.4.3), which changes both the run length and the recall base at the same time. Therefore, we hypothesise that these issues may be even more severe in the case of more complex evaluation measures, like AP and others, which are not even interval scales and mix recall base and run length with rank position and various forms of utility accumulation and stopping behaviours.
Finally, also the way in which we interpret the results of correlation analysis may be impacted. Indeed, we typically attribute differences in correlation values to the different user models embedded by evaluation measures. The rule-of-thumb by Voorhees [96,97] is that an overall correlation above 0.9 means that two evaluation measures are practically equivalent, an overall correlation between 0.8 and 0.9 means that two measures are similar, while dropping below 0.8 indicates that measures are departing more and more. Therefore a correlation of 0.8588 would suggest that Precision and Recall share some commonalities but they differ enough due to their user models, still not being pathologically different. However, we (now) know that they are just the transformation of the same interval scale and that this correlation value is just an artifact of mixing different scales across topics rather than an intrinsic difference in the user models of Precision and Recall.

Why May Scales Change from Topic to Topic or from Run Length to
Run Length?
As discussed above, Ferrante et al. [33] have demonstrated that Precision, Recall, and F-measure are interval scales when you fix the length of the run N and the recall base RB, i.e. they are an homomorphism with respect to the same ordering of runs in the ERS. However, if we mix together runs with different bounded lengths and/or different bounded recall bases, Precision, Recall and F-measure are no more interval scales, they are no more an affine transformation of each other and they even order the runs in different ways. Clearly, this is a severe issue when you need to average (or compute any other aggregate) across different topics or runs with different lengths. Let us consider the universe set S[N, K] which contains all the runs of any possible length n, less than or equal to N , and with respect to all the possible recall bases RB, less than or equal to K. To avoid trivial cases, we consider always N and K greater than or equal to 1. A run in S[N, K] is represented by a triple [r, n, RB], where r indicates the number of relevant documents retrieved by the run, n is the length of the run, and RB is the recall base, i.e. the total number of relevant documents for a topic. Note that, for each run in S[N, K] it holds n ≤ N and RB ≤ K by construction, but we also have r ≤ (n ∧ RB), where x ∧ y = min{x, y}, i.e. there is a (implicit) dependence on the recall base when it comes to the number of relevant retrieved documents.
We define S n,RB as the set which contains all the runs with the same length n and with respect to the same recall base RB. Therefore, we can express the universe set S[N, K] as the union of such sets, namely S n,RB S n,RB models the typical case of runs all with the same length for a given topic (or for a set of topics which have the same recall base). This is exactly the case for which Ferrante et al. [33] have demonstrated that Precision, Recall and F-Measure are interval scales and an affine transformation of each other. However, this holds for each S n,RB separately while the issue we discuss in this section is what happens when you mix different S n,RB , i.e. when you go towards S[N, K].

Precision
Precision is equal to the fraction of the retrieved documents that are relevant. Therefore, for a run represented by triple [r, n, RB], Precision is given by P rec[r, n, RB] = r n Let us start from S n,RB : P rec maps this set into the set 0, 1/n, 2/n, . . . , (n ∧ RB)/n and it has been proven by Ferrante et al. that P rec is an interval scale in this case. However, already in this simpler case, there is a (implicit) dependency on the recall base, when it comes to the possible values of Precision. Therefore, even when we consider runs with the same length but for topics with different recall bases, i.e. S n,RB1 , S n,RB2 , S n,RB3 , ... we are dealing with different scales, all embedded in the single interval scale whose image is 0, 1/n, 2/n, . . . , (n ∧ max{RB i })/n .
To understand the problems arising mixing different lengths and recall bases, let us consider the general scenario of Precision defined on S[N, K]. This is the case where we consider the Precision measure defined on the set of the runs of any possible bounded length and recall base and we find that it is an interval scale only in the almost trivial cases of N ≤ 2. Indeed, P rec maps S[1, K], for any K ≥ 1, into the set {0, 1} and it is an interval scale since these values are equispaced. When N = 2, P rec maps S[2, K], for any K ≥ 1, into the set {0, 1/2, 1}; Since the values are equispaced, Prec is still an interval scale. To compare the order induced on these sets by P rec (and the other measures), let us consider in more detail S [2,2]. This set is Continuing with a similar construction for N = 3, we obtain that P rec assumes the four possible values {0, 1/3, 1/2, 1}, when K = 1, and the five possible values {0, 1/3, 1/2, 2/3, 1} when K ≥ 2. Indeed, for runs of length at most 3, these are all the possible values of the fraction r n for 1 ≤ n ≤ 3 and 0 ≤ r ≤ min{3, K}. Since these values are not equispaced, it is sufficient to state that P rec is not an interval scale on S [3, K].
To prove that P rec is not interval on S

Recall
The Recall measure depends explicitly on the recall base RB, i.e. the total number of relevant documents available for a given topic Recall[r, n, RB] = r RB Note that for any admissible run [r, n, RB], i.e. for which r ≤ n∧RB, its recall value (implicitly) depends on n, creating a specular situation with respect to the one of Precision.
Recall is an interval scale on S n,RB , since it is an affine transformation of P rec, as demonstrated by Ferrante et al.. However, due to the (implicit) dependency on n, even when we consider topics with the same recall base but runs with different length, i.e. S n1,RB , S n2,RB , S n3,RB , ... we are dealing with different scales, as discussed below.
Recall When we define Recall on S[N, K], for K > 2, we have that this measure is no more interval thanks to an argument similar to that used for Precision. Indeed, the two smallest non zero values of Recall on S[N, K], are as 1/K and 1/(K − 1), here obtained for a run with a unique relevant document with respect to a topic with RB = K and RB = K − 1, respectively.
Furthermore, it is immediate to see that Recall and Precision induce for any N ≥ 2 and K ≥ 2 two different orderings on S[N, K], i.e. they become two different scales. Indeed, for any two runs [r 1 , n 1 , RB 1 ] and [r 2 , n 2 , RB 2 ], we have that P rec[r 1 , n 1 , RB 1 ] < P rec[r 2 , n 2 , RB 2 ] if and only if r 1 /n 1 < r 2 /n 2 , while Recall[r 1 , n 1 , RB 1 ] > Recall[r 2 , n 2 , RB 2 ] if and only if r 1 /RB 1 > r 2 /RB 2 . Both these condition are satisfied when n 2 /n 1 < r 2 /r 1 < RB 2 /RB 1 For example, if we take r 1 = r 2 , n 1 = 2n 2 and RB 2 = 2RB 1 , the previous condition is satisfied and the two runs are ordered in a different way by the two measures.

F 1 Measure
The F 1 measure is the harmonic mean of precision and recall  When we define F 1 on S[N, K], for N ≥ 3 or K ≥ 3, we have that this measure is no more an interval scale, as can be easily seen since the three smallest values of the image are 0, 2/(N + K) and 2/(N + K − 1), which are not equispaced. At the same time, using an example similar to the one used for Recall, we obtain that the ordering induced on S[N, K] by F 1 in these latter cases differs from both the orderings induced by P rec and Recall.

Summary and Discussion
We have demonstrated that, when we consider runs with a fixed length n and with respect to a fixed recall base RB, i.e. we consider S n,RB and runs of the same length for the same topic (or, more generally, topics with the same recall base), Precision, Recall, and F-measure are interval scales and they are an affine transformation of each other. As a consequence, they order runs in the same way and their Kendall's τ is 1.
However, when we start mixing runs with different length and/or with respect to different recall bases, the situation quickly gets more complicated. Only in the trivial (and not very useful in practice) case N = 2 and K = 2, i.e. S [2,2] where we have runs of length 1 or 2 and topics with 1 or 2 relevant documents, Precision and Recall are still both interval scales but they stop to be an affine transformation of each other. As a consequence, they order runs in different ways and their Kendall's τ is less than 1. F-measure already stops to be an interval scale and orders runs in yet another way than Precision and Recall, leading to a Kendall's τ less than 1. For N > 2 and K > 2 all of them (Precision, Recall, and F-measure) stop to be interval scales, departing from the interval assumption more and more, and they order runs in three completely different ways, again leading to a Kendall's τ less than 1. In the special case where we fix the length, Precision is still an interval scale, while if we fix a single recall base, Recall is still an interval scale, but in both cases the other measure is no more interval and also order the runs in a different way.
We may be tempted to consider as positive the fact that sooner than later Precision, Recall, and F-measure start ordering runs in a different way and that their Kendall's τ is less than 1. Indeed, this is what we expect from evaluation measures, to embed different user models and to reflect different user preferences in ordering runs. This is also one of the main motivations why there is debate and we would accept to derogate from requiring them to be interval scales: reflecting user preferences could be more important than complying with rigid assumptions.
However, we should carefully consider how this is happening. They initially are the "same" scale (except for an affine transformation), when we use them to measure objects with some shared characteristics, i.e. same run length n, and with respect to a similar context, i.e. same recall base RB. However, as soon as we measure objects with more mixed characteristics and contexts, they cease to be the "same" interval scale and only at that point they begin to order runs differently. This is more or less like saying that kilograms and pounds are the "same" interval scale only when we weigh people with the same height and from the same city but, as soon as we weigh people with different heights and/or coming from flatland or mountains, they become two different scales and they also possibly stop to be interval scales. This would sound odd and quite different from saying that weight and temperature are different (interval) scales because they measure different attributes/properties of an object or, in our terms, they would reflect different user preferences.
Why does this happen? Because run length n and recall base RB change. This is very clear and somehow more extreme in F-measure, where both n and RB explicitly appear in the equation of the measure.
We hypothesize that this could be even more severe and extreme in the case of rank-based measures since not only they combine, implicitly or explicitly, the two factors n and RB but they also mix them with the rank of a document and various discounting and accumulation mechanisms. Figure 2 gives a taste of this much more complex situation: it shows the simple (and somehow safe) case of S 4,4 and it already emerges how different are the behaviours and patterns in violating or complying with the interval scale assumption.
Why does this matter? As already said, because we need to aggregate scores across topics and runs and to compute significance tests. We do not only have the problem of how much evaluation measures violate the interval scale assumptions, required to compute aggregates, but also the issue of not mixing apples and oranges, i.e. scores from different scales, required to make aggregates sensible. In this respect, run length is a less severe issue which can be easily mitigated in practice, either by forcing a given length or because we are interested in lower cut-offs, e.g. 5, 10, 20, 30. The effect of the recall base can be mitigated by adopting measures that do not explicitly depend on it, even if the implicit dependency due to the capping of the image values will remain.

Related Works
van Rijsbergen [92] was the first to tackle the issue of the foundations of measurement for IR by exploiting the representational theory of measurement. He observed that [92, pp. 365-366] The problems of measurement in information retrieval differ from those encountered in the physical sciences in one important respect. In the physical sciences there is usually an empirical ordering of the quantities we wish to measure. For example, we can establish empirically by means of a scale in which masses are equal, and which are greater or lesser than others. Such a situation docs not hold in information retrieval. In the case of the measurement of effectiveness by precision and recall, there is no absolute sense in which one can say that one particular pair of precision/recall values is better or worse than some other pair, or, for that matter that they are comparable at all Later on, van Rijsbergen [94, p. 33] further stressed this issue: "There is no empirical ordering of retrieval effectiveness and therefore any measure of retrieval effectiveness will be by necessity artificial".
van Rijsbergen addressed this issue by exploiting the additive conjoint measurement [53,58]. Additive conjoint measurement was a new part of the measurement theory developed as a reaction to the views of Campbell [17,18] and the conclusions of Ferguson Committee of British Association for the Advancement of Science [30], where Campbell was an influential member, which considered the additive property, i.e. the concatenation operation mentioned in Section 2.1, as fundamental to science and proper measurement; as a consequence, measurement of psychological attributes, which is lacking such additive property, was not possible in a proper scientific way. As explained by Michel [63, p. 67] Conjoint measurement involves a situation in which two variables (A and B) are noninteractively [e.g. non additively] related to a third (X). It is not required that any of the variables be already quantified, although it is necessary that the values of X be orderable, and that values of A and B be independently identifiable (at least at a classificatory level). Then, via the order on P , ordinal and additive relations on A, B, and X may be derived Typical examples from physics are the momentum of an object, which is affected by its mass and velocity, or the density, which is affected by its mass and volume [53].
van Rijsbergen considered retrieval effectiveness as the "orderable X" mentioned above and took precision P and recall R as the two variables A and B. In particular, he demostrated that on the relational structure (R × P, ) it was possible to define an additive conjoint measurement and to derive actual measures of retrieval effectiveness from it. Note that, in this way, he avoided the need to explicitly define what an ordering by retrieval effectiveness is and he considered that precision and recall contribute independently to retrieval effectiveness. The problem of how to order runs in the ERS has been addressed some years later by Ferrante et al. [31,32,33]. More subtly, van Rijsbergen treats precision and recall as two attributes which can be jointly exploited to order retrieval effectiveness but, each of them, is already a measure and quantification of retrieval effectiveness and, thus, this brings some circularity in the reasoning. Finally, van Rijsbergen did not address the problem of which are the scale properties of precision and recall (or other evaluation measures), which has been later addressed by Ferrante et al..
Bollmann and Cherniavsky [13,14] built on the conjoint measurement work by van Rijsbergen and applied it to further study under which conditions the MZ-metric [47]. In particular, Bollmann and Cherniavsky leveraged what they called transformational viewpoints, i.e. elementary transformations of the runs which closely resemble the idea of swap and replacement used by Ferrante et al. much later on.
Bollmann [12] studied set-based measures by showing that measures complying with a monotonicity and an Archimedean axiom are a linear combination of the number of relevant retrieved documents and the number of not relevant not retrieved documents and how this could be related to collections and sub-collections. He thus addressed a problem somehow different from the one of the present paper, still leveraging the representational theory of measurement.
Busin and Mizzaro [16], Maddalena and Mizzaro [60] and Amigó and Mizzaro [5] proposed a unifying framework for ranking, classification, and clustering measures, which is rooted in the representational theory of measurement as well. They considered scales but as a way of mapping between relevance judgements (assessor scales) and Retrieval Status Value (RSV) (system scales) and of introducing axioms over them rather than a way of studying which are the scales actually used by IR evaluation measures and their impact on actual experiments.
As already discussed, Ferrante et al. [31] relied on the representational theory of measurement to formally study when evaluation measures are on an ordinal scale while Ferrante et al. [32,33] proposed a more general theory of evaluation measures, proving when they are on an interval scale or not. Finally, Ferrante et al. [34] conducted a preliminary experimental investigation of the effects of IR measures being interval scales or not.
Even if not specifically focused on scales and their relationship to IR evaluation measures, there is a bulk of research on studying which constraints define the core properties of evaluation measures: Amigó et al. [6,7,8,9] and Sebastiani [81] face this issue from a formal and theoretical point of view, applying it to various tasks such as ranking, filtering, diversity and quantification, while Moffat [64] adopts a more numerical approach.
As it emerges from the above literature review, to the best of our knowledge, no one has dealt yet with the problem of considering the meaningfulness of IR experimental results and of transforming IR evaluation measures into interval scales.

Transforming IR measures to interval scales
Let (REL, ) be a totally ordered set of relevance degrees, with a minimum called the non-relevant relevance degree nr = min(REL) and a maximum rr = max(REL); in the case of binary relevance, we set REL = {0, 1} without any loss of generality. Let N be the length of a run, i.e. the number of retrieved documents, we call judged runr t ∈ REL N the vector of relevance degrees associated to each retrieved document, denoting byr[j] the j-th element of the vector.
Any IR evaluation measure M naturally defines an order among system runs. Indeed, taken any two runsr,ŝ ∈ REL N , we order them as followŝ Note that this is a weak total order, since M (r) = M (ŝ) does not imply thatr =ŝ, and that it is the order called induced total order by Ferrante et al. [33]. It has the following characteristics, as discussed in the previous sections: • it differs from measure to measure, i.e. each measure may produce a different RoS; • it typically is not an interval scale, i.e. the produced values are not equi-spaced.
The basic idea of our approach is to keep the weak total order (1) produced by the measure M but making sure that all the possible values are equi-spaced.
The simplest way to achieve this result is to define first the nonlinear transformation ϕ from [0, 1] into N that maps each value m in the image of the measure M into its rank number. Then, we define the ranked version of the measure, i.e. the interval-scaled version of it, as M R = ϕ(M ). Note that this approach is in line with what suggested by Gaito [41] to transform ordinal scales into interval ones.
Most of the measures are not one-to-one mappings and thus the cardinality of their image is strictly smaller than the cardinality of their domain, i.e. |M (REL N )| < |REL N |. The runs which are assigned the same value by the measure are called ties. As pointed out before, this is the reason why the order induced on REL N by a measure in general is just a weak total order.
The map ϕ is then defined for any value m in the image M (REL N ) as M R = ϕ(M ) is an interval scale since the ranks are equi-spaced by construction; moreover, it preserves the RoS of M and thus it constitutes an interval-scaled version of it.
Finally, we have to deal with tied values in the measure. In statistics there are many ways of breaking ties [44]) and the most common are: average/mid, min, or max rank. However, each of these alternative For RBP with p = 1/2 we have that ϕ(m) = 2 N m while for RR we have that ϕ(m) = N + 1 − 1 m . However, in general, the function ϕ does not have any analytical expression, it is nonlinear and it varies from measure to measure.

Runs of different length
When working with runs all with the same length, the proposed approach maps a measure into a proper interval scale, actually the same scale for all the runs, and this allows us to compute aggregates across runs with the same length.
If we work we runs of different length, the proposed approach maps each length (run) into a proper interval scale but it differs from length to length. For example, in the case of DCG with log 2 there are there are 24 distinct values for N = 5, 768 for N = 10, 24, 576 for N = 15, and so on; all of them correspond to a ranked measure (interval scale) with a different number of steps. As a consequence, even using our approach, we cannot aggregate across runs with different length.
However, as already discussed, this is a problem easily manageable in practice. Indeed, for small run lengths or low cut-offs of typical interest, such as the top 10 documents, it is reasonable to assume that the runs have all the same length, since runs are usually able to retrieve enough documents. In the more general case and for bigger run lengths, if a run does not retrieve enough documents, it could be padded with not relevant documents. Therefore, we can consider our approach as generally applicable with respect to this issue.

Different topics
Let us now assume that we have fixed a run length which is the same for all the runs and which allows us to compute aggregates across runs. What happens if we need to compute aggregates across topics?

Measures not depending on the recall base
In the case of measures not depending on the recall base, since the length of the run is the same for all the runs across all the topics, our approach maps a measure into the same interval scale for all the runs and all the topics. Therefore, we can safely compute aggregates also across topics.

Measures depending on the recall base
In the case of the measures depending on the recall base, as already explained, due to the recall base changing from topic to topic, it does not exist a single (interval) scale which can be used across all the topics. As a consequence our approach could not be directly applied. However, we could use it as a surrogate that brings, at least, some more "intervalness" to a measure.
Indeed, on each single topic, our approach maps a measure depending on the recall base into a proper interval scale, whose steps are equi-spaced. When we deal with two (or more) topics, we would need to find an interval scale where it is possible to match the steps from the (two) scales of each topic into some "bigger" set of equi-spaced steps, which can accommodate all of them. However, as shown in Example 11 and in Section 3.4, this common super-set of steps does not exist, if not in trivial cases.
Therefore, as an approximation, we can pretend that the scale for each topic is the overall common scale -and, as said above, this is exactly what happens in the case of measures not depending on the recall base -and use it across topics, even if this will actually stretch the steps of each topic. ]. If the recall base for the first topic q 1 is RB 1 = 2, these runs are mapped to the following Recall values {0, 1 2 , 1 2 , 1}; if the recall base for the second topic q 2 is RB 2 = 3, these runs are mapped to the following Recall values {0, 1 3 , 1 3 , 2 3 }. Our transformation approach maps the runs of both topics to {1, 2, 2, 3}, which is a proper interval scale on each topic separately. However, if we look at the two topics together and we use this mapped scale, we are slightly stretching the steps of the original scales. For example, if we compute the difference between r 2 and r 3 , on this mapped scale it is same, i.e. 1, for both q 1 and q 2 while on the original scales it is 1 2 for q 1 and 1 3 for q 2 . This means that our transformation also effects ordinal scales when a recall base is involved.
We considered the following datasets: • Adhoc track T08 [101]: 528,155 documents of the TIPSTER disks 4-5 corpus minus congressional record; T08 provides 50 topics, each with binary relevance judgments and a pool depth of 100; 129 system runs retrieving 1,000 documents for each topic were submitted to T08.
• Common Core track T26 [3]: 1,855,658 documents of the New York Times corpus; T26 provides 50 topics, each with multi-graded relevance judgments (not relevant, relevant, highly relevant); relevance judgements were done mixing depth-10 pools with multi-armed bandit approaches [57,100]; 75 system runs retrieving 10,000 documents for each topic were submitted to T26.
• Common Core track T27 [4]: 595,037 documents of the Washington Post corpus; 50 topics, each with multi-graded relevance judgments (not relevant, relevant, highly relevant); relevance judgements were done adding stratified sampling [22] and move-to-front [24] approaches to the T26 procedure; 72 system runs retrieving 10,000 documents for each topic were submitted to T27.
In the case of the T26 and T27 tracks we mapped their multi-graded relevance judgement to binary ones using a lenient strategy, i.e. whatever above not relevant is considered as relevant 7 .
For each track we experimented the following run lengths N ∈ {5, 10, 20, 30}, i.e. we cut runs at the top-N retrieved documents. In terms of our transformation methodology, this means considering a space of possible runs containing, roughly, {32, 10 3 , 10 6 , 10 9 } runs, respectively 8 . We indicated the run length in the identifier of the track, e.g. T08 10 means T08 runs cut down at length 10.
In significance tests, we used a confidence level α = 0.05. To ease the reproducibility of the experiments, all the source code needed to run them is available in the following repository: https://bitbucket.org/frrncl/tois2021-fff-code/src/master/.

Experiments
In Section 7.1 we validate our approach and answer the research question "How far a measure is from being an interval scale?". Then, in the next two sections we investigate on the effects of using or not a proper interval scale in IR experiments. In particular, in Section 7.2 we study how this affects the correlation among evaluation measures, i.e. the main tool we use to determine how close are two measures. In Section 7.3 we analyse how this impacts on the significance tests, both parametric and non-parametric, i.e. the main tool we use to determine when IR systems are significantly different or not.
The following sections report, separately, the case of measures not depending on the recall basenamely, RBP, DCG, RR, and P -and measures depending on the recall base -namely, AP, nDCG, and R. Indeed, as previously explained, fixed a run length, in the former case it is possible to find an overall interval scale, which is the same across all the topics, and apply our transformation approach in an exact way; in the latter case, an overall interval scale common to all the topics does not exist and our transformation is just the best surrogate that can be figured out.

Correlation between measures and their ranked version. How far a measure is from being an interval scale?
In this section we study the relationship between each measure and its ranked version, i.e. its mapping towards an interval scale, as explained in Section 5. This analysis allows us: 1) to validate our approach, verifying that it produces the expected results; 2) to understand how much a measure changes when it is transformed, seeking an explanation for this change; 3) to understand what happens when we apply our transformation approach in a "surrogate" way in the case of measures depending on the recall base. We compute both the overall and the topic-by-topic Kendall's τ correlation between each measure M and its ranked version 9 . Ferro [35] has shown that, even if the absolute correlation values are different, removing or not the lower quartile runs produces the same ranking of measures in terms of correlation; similarly, he has shown that both τ and AP correlation τ ap [105] produce the same ranking of measures in terms of correlation. Therefore, we focus only on Kendall's τ without removing lower quartile systems.
As explained in Section 3.3, the topic-by-topic correlation is expected to be always 1.0, since the ranked version of a measure preserves the same order of runs on each topic by construction. As a sanity check, we verified that the topic-by-topic correlation is indeed 1.0 in all the cases and we do not report it in the following tables for space reasons. On the contrary, the overall correlation, i.e. the traditional one, can be different from 1.0 for the reasons discussed above: the preliminary average operation would be not allowed in the case of an original measure which is not an interval scale while it is allowed in the case of the corresponding ranked measure; and, different recall bases across topics can lead to different scales which should not be averaged together.
In general, we can consider the overall Kendall's τ correlation between a measure and its ranked version as a proxy providing us with an estimation of how far a measure is from both being a proper interval scale and, in the case of measures depending on the recall base, also being safely averaged across topics. Note that this approach is in line and extends what proposed by Ferrante et al. [34] when they suggested to use the overall Kendall's τ correlation between a measure and the Set-Based Total 8 To give an idea of the computational resources required, runs of length N = 30 mean an occupation of 2 30 * 30 * 8 = 240 GByte of memory, just for holding all the possible runs. A length N = 40 would mean 320 TByte of memory, which is not feasible in practice.
The code is implemented in Matlab and thus we considered 8 bytes for representing a digit, since this is the size of a double. Even if we considered a more compact representation, in some other language like C, using just 1 bit per digit, it would have meant 40 TByte of memory for runs of length N = 40. 9 To avoid errors due to floating point computations, we rounded averages to 8 decimal digits. Order (SBTO) and Rank-Based Total Order (RBTO) interval scales as an estimation of how much a measure is an interval scale. Table 2 summarizes the outcomes of the overall correlation analysis between measures and their ranked version, e.g. we computed the Kendall's τ correlation between DCG and DCG R , the intervalscaled version of DCG according to the approach we described in Section 5.
From a very high level glance at Table 2 we can see that Kendall's τ correlation changes due to the transformation but not in a too excessive way which suggests that we are not running into any pathological situation.

Measures not depending on the recall base
As previously discussed, Precision is already an interval scale -a different scale for each run length but, fixed a length, the same scale for all the topics, allowing to safely average across them. In this case, our transformation is just a mapping between interval scales, as the transformation between Celsius and Fahrenheit is. We can see as the overall Kendall's τ correlation is always 1.0, experimentally confirming the correctness of our transformation approach and that everything is working as expected.
The other case in which we see this happening is RBP p05, which we already know to be an interval scale, but different from the one of Precision.
On the other extreme, there is RR, which is the farthest away from being an interval scale, among the measures not depending on the recall base. We can observe as the overall Kendall's τ correlation is in the range 0.67 − 0.93 and it is systematically lower then the correlation of all the other measures in this group -namely P, RBP, and DCG. This suggests that transforming RR into an interval scale requires a more marked correction or, in other terms, that it experiences a drop in "intervalness" in the range 7% − 33%. We can observe also another stable pattern in the case of RR: the bigger the length of the run, the lower the correlation between RR and its ranked version. This suggests that RR departs more and more from the interval scale assumption as the run length increases; we will now see why this is happening. Figure 3 plots the values of P, RBP with p = 0.5, and RR for all the possible runs of a given length. On the X axis there are the runs, increasingly ordered by the value of the measure; this is the order of runs considered by the ranked version of the measure which then just equi-spaces the values and removes ties. The Y axis reports the value of the measure for each run. The labels on the X axis report the fraction of runs up to that point; so, for example, in the case of P and N = 5, we can understand that 20% of the runs assume the value P = 0.2, 30% the value P = 0.4, 30% the value P = 0.6, and 20% the value P = 0.8; 1 run the value 0.0 and 1 run the value 1.0.
We can observe as RBP with p = 0.5 produces distinct equi-spaced values for each of the possible runs or, in other terms, it produces equi-spaced clusters of values containing one single value in each cluster. In the case of P, we can see how the clusters are still equi-spaced but they contain tied values, visible as horizontal segments. Finally, in the case of RR, not only the clusters are not equi-spacedand this breaks the interval scale assumption -but they also increase more and more, and only in one region of the range, as the run length increases, making RR less and less interval. Indeed, the number of clusters is not equi-spaced but always constant to 5   Moreover, Figure 3 visually shows us why different run lengths correspond to different scales -interval or not depending on whether clusters are equi-spaced or not. In all the cases, the number of clusters increases as the length of the run increases and this makes the scale to be different. Note that this behaviour is not like getting a more and more accurate scale, which would be a desirable property, but it is rather like saying that the height scale would become denser and denser as you measure taller and taller people. Therefore, as already said, we should avoid to mix values of measures coming from runs with different lengths, e.g. by averaging.
When it comes to RBP, we know from Ferrante et al. [33,32] that: RBP p05 is an interval scale; RBP p03 is an ordinal scale keeping the same ordering as RBP p05 but being no more an interval scale; and, RBP p08 uses a different ordering from RBP p05 and it is not an interval scale. This is also reflected in the overall correlation values. As already noted, and expected, the overall correlation for RBP p05 is always 1.0 while it drops in the range 0.95 − 0.97 for RBP p03. RBP p03 and RBP p05 order runs in the same way, which also means that their ranked version is the same. Therefore, the 3% − 5% difference between RBP p03 and RBP p05 depends only on the lack of equi-spacing of RBP p03 and the problems it causes when averaging. This also means that this drop in "intervalness" of RBP p03 is not the effect of a user model somehow different from the one of RBP p05, possibly resulting in a different order of the systems, which is the typical explanation provided in these cases instead. In the case of RBP p08 we observe a similar behaviour and the correlation drops in the range 0.93 − 0.96 with an "intervalness" loss in the range 4% − 7%.
When it comes to increasing run lengths, we can observe that the correlation values of RBP oscillate a bit, they tend to get more stable as the run length increases and this happens more for RBP p08 than for RBP p03. While this still might be partially due to the measure being more or less interval scale depending on the run length, we think that in the case of RBP this is mostly motivated by another reason. Indeed, as previously discussed, RBP does not use the full range [0, 1] because of the 1 1−p overestimation which impacts more as p increases and the length of the run is smaller. Therefore, we think that the increase in range of RBP is the motivation of the observed small changes in the correlation values. We can clearly see this behaviour in Figure 4 for RBP p08 whose values fall in the full range [0, 1] only for N = 20 and N = 30, while this effect is mostly negligible for RBP p03 and RBP p05. As a consequence, correlation values tend to get more stable for N = 20 and N = 30 in the case of RBP p08 while they are quite stable for RBP p03 and RBP p05, independently from the run length.
Finally, when it comes to DCG, we can observe from Table 2 that its overall correlation is above 0.9 with an "intervalness" loss in the range 2% − 9%, suggesting it is moderately departing from its ranked version. We can also observe for DCG b10 that the correlation for run lengths N = 5 and N = 10 is always 1.0, which may look surprising; this is actually an artefact of the log base 10 which causes the discount to be applied from the 11th rank onwards. Therefore, for run lengths up to the log base, DCG is basically counting the number of relevant retrieved documents, as it is clear from Figure 5, and this produces the same interval scale as P. However, we should be aware of this somehow unusual behaviour of DCG b10 because it changes from being an interval scale for runs up to length 10 to not being it anymore afterwards.
Moreover, as a general trend, DCG tends to be less and less an interval scale as the run length increases. If we look at the possible values of DCG in Figure 5, this may sound surprising since DCG visually behaves in a very similar way to RBP p08, at least after the run length is big enough to compensate for possible effects of the log base itself. However, while RBP 08 does not have tied values, DCG exhibits an increasing number of tied values, clustered unevenly across the range -this is not visible from the figure, especially for DCB b02, due to the small size of the tied clusters but we have verified it on the numerical data underlying the plot. Therefore, the increasing number of uneven tied cluster explains, as in the case of RR, why DCG is less and less an interval scale.

Measures depending on the recall base
Let us go back to Table 2 and consider the case of R. We know that Precision and Recall, on each topic separately, are already interval scales and just a transformation of the same interval scale. Therefore, when we map them to their ranked version, it is actually the same interval scale for both of them and it is yet another transformation of their common original interval scale. However, while this means an overall correlation 1.0 in the case of Precision, it drops in the range 0.7 − 0.9 in the case of Recall. This 10% − 30% loss in "intervalness" is entirely due to the effect of the recall base and let us understand how careful we should be before averaging across topics.  On the other hand, nDCG exhibit overall correlation values very close to those of DCG, all above 0.9 with an "intervalness" loss in the range 2% − 10%. We observe another somehow surprising behaviour of nDCG b10: for runs of length N = 5 the correlation is always 1.0, indicating that it is an interval scale and, most of all, that there is no effect of the recall base. The fact is that on all the tracks under examination, all the topics have at least 5 relevant documents, so the recall base is never below 5; when you trim runs to length N = 5, the DCG b10 of the ideal run, i.e. the factor used to normalize DCG in nDCG, is constant to 5 for all the topics and so there is not recall base effect for this reason. On the other hand, there is 1 topic with less than 10 relevant documents on both T08 and T26 and 4 topics on T27. As a consequence, nDCG b10 drops slightly below 1.0 on T08 10 and T26 10 and a bit more on T27 10. This further stresses the need to be careful, or at least aware of, that DCG/nDCG may change behaviour and nature for document cut-offs below the log base. Moreover, this gives us an idea of how much even very small changes in the recall base can have an impact and how careful we should be when aggregating across topics.
Finally, both DCG and nDCG are mapped to the same ranked measure, exactly as P and R are mapped to the same ranked measure. However the loss of "intervalness" of R is much bigger than the one of nDCG. We hypothesise that this is due to how the recall base is accounted for in the measure: in the case of R it is a straight division by the recall base itself while in the case of nDCG it is a division by the DCG of the ideal run, which is also one of the possible runs considered in the mapping. The latter is a much smoother normalisation than just an integer number representing the total number of relevant documents. The behaviour of AP, very close to the one of R, supports this intuition, since also AP adopts a straight division by the recall base itself.

Impact of the tie breaking strategy
In this section, we perform a further validation of our mapping approach. As explained in Section 5, evaluation measures often produce tied values and we remove these tied values by assigning them their unique rank position, since this ensures that values are kept equi-spaced. However, as pointed out by Gibbons and Chakraborti [44], there are many other common ways of breaking ties, one of which if the mid-rank strategy, i.e. keeping the average of the ranks of the tied values. Table 3 shows what happens to our transformation approach when using the mid-rank tie breaking instead of the unq one used in Table 2. Let us consider Precision whose overall correlation values drops from 1.0 to the range 0.90 − 0.97. Since Precision is already an interval scale, this drop is entirely due to the fact that the mid-rank tie breaking strategy produces a scale whose values are not equi-spaced anymore and thus a scale which is not interval anymore. Only RBP p05 keeps the overall correlation 1.0 because it is already an interval scale but it does not have any tied value, so it is insensitive to the tie Table 3: Kendall's τ overall correlation analysis between each measure and its respective ranked version, using the mid-rank tie breaking approach. breaking strategy. As a general trend, we can see that the overall correlation values in Table 3 are lower than those in Table 2 due to the loss of "intervalness" caused by the tie breaking strategy. Therefore, we validated that the appropriate way of implementing our transformation approach is by using the unq tie breaking strategy. Moreover, this further stresses how much the lack of equi-spaced values, lack due to any reason, impacts on our measurement process.

Correlation among measures and among their ranked versions. Unveiling
the "true" correlation among evaluation measures Table 4 summarizes the outcomes of the correlation analysis among measures and among their ranked versions, i.e. on the one side we compute Kendall's τ overall correlation among all pairs of measures, on the other side we compute Kendall's τ overall correlation among the same pairs of ranked measures.
In this way, we can study whether and how the estimated relationship among measures changes when passing to their ranked version or, in other terms, to what extent being an interval scale or not biases our estimations. In particular, the column ∆% reports the percent increase/decrease of the correlation between the ranked measures (labelled RnkMsr) with respect to the correlation between the original measures (labelled Msr), i.e. how much the correlation between two measures is underestimated/overestimated due to the fact that a measure is not an interval scale. Table 4 reports results for the T08 30, T26 30, and T27 30 tracks; results for the other tracks are similar but not shown here for space reasons. We can observe from Table 4, as very coarse and general trends, that correlation is overestimated (∆% column negative) in the range [−33.30%, −0.16%], i.e. two evaluation measures are less close to each other than what we would be induced to think; conversely, correlation is underestimated (∆% column positive) in the range [0.4%, 25.82%], i.e. two evaluation measures are closer to each other than what we would be induced to think. This observation opens up a relevant question for IR experiments: are IR measures really that different? Do we need all of them? Are we really scoring runs according to different user viewpoints or are these differences just an artefact of violating the scale assumptions? How much of what reported in the literature is due just to this scale violation bias?
In the following sections we discuss a few examples from Table 4 of how the correlation may change.

Measures not depending on the recall base
Let us start from the correlation between Precision and RBP with p = 0.5. We already know that they are interval scales and, therefore, their ranked version is just another mapping of their respective interval scales -and this is why in Table 2 their overall correlation is 1.0. We can observe from Table 4 that on T08 30 the correlation between RBP p05 and P is 0.7858 and, as expected, the correlation between their ranked versions is the same, since the interval scale behind the original measures and their ranked version is the same. The same happens for the other tracks, i.e. T26 30 and T27 30.
Note that "the correlation between Precision and RBP with p = 0.5 on T08 30 is 0.7858" is an example of a meaningful statement in IR, since it is invariant to a permissible transformation of the interval scales of these two measures and it does not change its truth value.
The correlation between RBP with p = 0.3 and RBP with p = 0.5 is 0.9498 while the correlation between their ranked versions is 1.0. As we discussed in Section 7.1, RBP p03 and RBP p05 order runs in the same way and so their correlation should be 1.0. Therefore, this 5.3% underestimation of the similarity between them is just due to RBP p03 not being an interval scale. Note that this case is somehow particularly severe since it induces us to attribute this 5.3% change to other reasons; typical explanations for such changes you may find in studies about correlation among evaluation measures are: "the user model behind RBP p03 slightly differs from the RBP p05 since it represents a more impatient or less motivated user" or "due to the smaller value of p, RBP p03 is a slightly more top-heavy measure"; unfortunately, none of these explanations would be correct since this 5.3% change is just due to fact that the values of RBP p03 are not equi-spaced, still ordering runs in exactly the same way as RBP p05.
For the sake of completeness, we can observe that 0.9498 is the same correlation value reported in Table 2 between RBP p03 and its ranked version. This is indeed correct since both RBP p03 and RBP p05 are mapped to the same ranked interval scale, which is just another mapping of the interval scale of RBP p05; therefore, the correlation between RBP p03 and its ranked version is the same as the correlation between RBP p03 and RBP p05.
Another interesting case is RR: its correlation with respect to P, RBP, and DCG is way over-estimated -in the range 12% − 26% more on T08 30, 26% − 33% more on T26 30, and 11% − 19% more on T27 30, -mistakenly suggesting us that this measure is closer to the others much more than what it actually is. As it emerges from the previous discussion, RR is one of the measures which departs more from being an interval scale and which also has the highest number of tied values. Therefore, the computation of averages on RR and on its ranked version leads to sensibly different Rankings of Systems (RoS), as it is clearly shown in Table 2 when comparing RR to its ranked version. As a consequence, the correlation of the ranked version of RR with the other measures changes more than in the other cases.

Measures depending on the recall base
Before proceeding, a word of caution has to be made remembering that in the case of measures depending on the recall base our approach is just a surrogate, which improves the "intervalness" of a measure but stretches the steps of the scale. Therefore, all the increases/decreases in correlations should be taken as tendency to overestimation/underestimation rather than exact quantification of it.
Let us consider Precision and Recall: we know that on each topic they are the same interval scale and this is reflected in Table 4 in the correlation between their ranked version being 1.0. On the other hand, the correlation between the original measures tends to be underestimated by 26% on T08 30, 12% on T26 30, and 22% on T27 30. Apart from suggesting that these two measures should be considered closer to each other than what they usually are, this wide range of underestimation further stresses how much just the recall base can affect the averaging across topics and how careful we should be with such averages -not to say that we should avoid them at all.
Coherently with what discussed in Section 7.1.2, we can observe that the correlation between DCG and nDCG tends to be underestimated by just 1% − 4%, suggesting that they are practically equivalent. Therefore, even if usually nDCG is preferred over DCG because it is bounded and normalised, it could be actually better to use DCG instead, since it avoids issues with the recall base and it can be easily turned into a proper interval scale by using our transformation approach.
Let us now discuss AP with respect to Precision and Recall: the correlation with R is higher than the one with P and this is usually attributed to AP embedding the recall base in the same way as R does. However, when we turn to the ranked measures, we can see how the correlation between AP and R and AP and P is the same (for all the reasons already explained) and, especially, how this tends to be underestimated in the range 1% − 9%, suggesting that AP is slightly closer to these two measures than what is usually thought.
Finally, let us consider AP with respect to DCG: their correlation tends to be underestimated in the range 9% − 16% and the correlation between their ranked versions is actually quite -high, between 0.94 and 0.97. This suggests that, even if these two measures have two quite different formulations and the user model of DCG is considered much more realistic than the somehow artificial one of AP, when they are turned into their interval scale version, they are much closer than expected and that part of their difference could have been just due to their lack of "intervalness". 7.3 Significance Testing Analysis. What systems are actually different, or not?
In this section, we analyse how the results of statistical significance tests change when using a measure or its ranked version. In other terms, we study how much statistical significance tests are impacted by using or not a proper interval scale. Indeed, as discussed in Section 7.3, there are significance tests which assume to work with just an ordinal scale and others which assume to work with an interval scale and they should be somehow affected by using a measure which matches or not their assumptions. By impacted, we mean that we can observe some change in which systems are considered significantly different or not. Moreover, as discussed in the previous sections, the recall base makes working across topics problematic at best and statistical significance tests typically perform some aggregation across topics. Therefore, they may be further affected by the recall base.
As described in Section 7.3, we consider the following tests: Sign (ordinal scale assumption), Wilcoxon Rank Sum (ordinal scale assumption), Wilcoxon Signed Rank (interval scale assumption), Student's t (interval scale assumption), ANOVA (interval scale assumption), Kruskal-Wallis (ordinal scale assumption), and Friedman (ordinal scale assumption). For the ANOVA case we consider two alternatives: • One-way ANOVA for the System Effect: y ij = µ ·· + α j + ε ij checks for the effect of α j , j = 1, . . . , q different systems. It can be considered as an extension of the Student's t test to the comparison of multiple systems at the same time and it is the parametric counterpart of the Kruskal-Wallis Test.
• Two-way ANOVA for the Topic and System Effects: a more accurate model y ij = µ ·· + τ i + α j + ε ij which accounts also for the effect of τ i , i = 1, . . . , p topics, thus improving the estimation of the system effect as well. Note that this is the ANOVA model adopted by Tague-Sutcliffe and Blustein [88] and Banks et al. [11] when analysing TREC data. It is the parametric counterpart of the Friedman Test.
In the case of the ANOVA, Kruskal-Wallis, and Friedman tests we performed a Tukey Honestly Significant Difference (HSD) adjustment for multiple comparisons [48,90]. Table 5 (measures not depending on the recall base) and Table 6 (measures depending on the recall base) show the results of the analyses in the case of the T08 30, T26 30, and T27 30 tracks. Results for the other tracks are similar but not shown here for space reasons. For each test, the tables report: • Sig: the total number of significantly different system pair using the original measure; • S2NS: number of pairs changed from significantly to not significantly different when passing from the original measure to its ranked version; within parentheses we report their ratio with respect to Sig; • NS2S: number of pairs changed from not significantly to significantly different when passing from the original measure to its ranked version; within parentheses we report their ratio with respect to Sig; • ∆%: S2NS+NS2S Sig the ratio of the total number of pairs that changed significance when passing from the original measure to its ranked version.
In Tables 5 and 6 rows corresponding to significance tests based on an ordinal scale assumption are highlighted in grey.
In an ideal situation an oracle would have told us which pairs of systems are significantly different and which are not and this would have allowed us to exactly determine which pairs of systems were correctly detected by each measure and test. Unfortunately, this a priori knowledge is not available in practice. On the other hand, we are comparing a measure to its ranked version and we know that changes in the decision about what is significantly different and what is not are a consequence of the steps of the scale being rearranged from not equi-spaced and un-evenly distributed across the their range to equi-spaced and evenly distributed across their range. Therefore, we can interpret the S2NS count as a tendency to false positives, since it accounts for significantly different systems which are not significant when you remove the effect of uneven steps in the scale; in other terms, S2NS can be interpreted as a tendency of the ranked measure (the interval scale) to reduce Type I errors. Symmetrically, we can interpret the NS2S count as a tendency to false negatives, since it accounts for not significantly different systems which are significant when you remove the effect of uneven steps in the scale; in other terms, NS2S can be interpreted as a tendency of the ranked measure (the interval scale) to reduce Type II errors. Note that we are not claiming that an interval scale detects/removes false positives/negatives in any absolute sense; we are rather saying that, starting from whatever unknown level of false positive/negatives, we can interpret the S2NS and NS2S counts as a relative tendency to reduce false positives/negatives.
As a side note not regarding measures being interval scales or not, in Tables 5 and 6 we can observe that, as expected, parametric significance tests are more powerful than non-parametric ones, since they discover more significantly different pairs. Moreover, we can also observe as the Sign, Wilcoxon Rank Sum, Wilcoxon Signed Rank, and Student's t tests find many more significantly different system pairs than the ANOVA, Kruskal-Wallis, and Friedman tests. This increase is not due to more powerful tests but rather to the increase in Type I errors due to the lack of adjustment in multiple comparisons for the former tests. This further stresses the need for always adjusting for multiple comparisons, as also pointed out by Fuhr [40], Sakai [75].
We can observe from Tables 5 and 6, as very coarse and general trends, that the less close an evaluation measure is to be an interval scale, the stronger are the changes in statistical significance tests based on an interval scale assumption while those based on an ordinal scale assumption are not affected. On the other hand, the presence of the recall base generally affects both tests based on the ordinal scale assumption and those based on the interval scale assumption, being the latter more affected.
In particular, we have found that: • for measures not depending on the recall base and significance test assuming an interval scale, we have an overall average increase in the S2NS count around 13% and in the NS2S around 5%. This suggests that the major impact is on reducing Type I errors still improving Type II errors and making the test more powerful; • for measures depending on the recall base and significance test assuming an ordinal scale, we have an overall average increase in the S2NS count around 2% and in the NS2S around 4%. This suggests that there is a small reduction in Type I errors and some improvement in Type II errors, making the test a bit more powerful; • for measures depending on the recall base and significance test assuming an interval scale, we have an overall average increase in the S2NS count around 10% and in the NS2S around 45%. This suggests that there is a sizeable reduction in Type I errors and a quite substantial improvement in Type II errors, making the test much more powerful.
In general, these results indicate that adopting a proper interval scale tends to reduce the Type I errors and, when the situation get more complicated because of the effect of the recall base across topics, it also brings substantially more power to the test. Overall, if we consider the grand mean across all the tracks, measures, and significance test, we observe an overall change ∆% around 25% ± 11% in the decision about what is significantly different and what is not. Even without wishing to interpret it in term of Type I or Type II errors, this figure let us understand how big is the impact of using an interval scale or not, as well as the effect of the recall base.
As in the case of the correlation analysis, these observations open some questions about IR experimentation: since violating the scale assumptions has an impact on the number of significant/not-significant detected pairs and on Type I and Type II errors, when we compare systems and algorithms, how much of the observed differences is just due to the scale violation bias? How many false positives/negatives are we observing? How much have the findings in the literature been affected by these phenomena?

Measures not depending on the recall base
Let us start from Precision and RBP p05 in Table 5. As we already know, both of them are interval scales and, as expected, we do not observe any changes in using them or their ranked version.
As in the case of the correlation, we can take them as an example of meaningful statements in IR, since a statement like "There are 1,923 significantly different system pairs for Precision according to one-way ANOVA on T08 30" does not change its truth value for a permissible transformation of the scale. As said, RBP p03 orders systems in the same way as RBP p05 but it is no more an interval scale. Coherently with this, we can see how the significance tests assuming just an ordinal case detect the same number of significantly different pairs for both RBP p03 and RBP p05. On the other hand, significance tests assuming an interval scale are affected by this difference between RBP p03 and RBP p05, causing an overall change ∆% in the range 2% − 29%. In particular, we observe a marked increase in the number of significantly different pairs (NS2S up to 28%), i.e. the reduction in the number of false negatives, and a very marginal decrease in the number of not significantly different ones (S2NS around 1%), i.e. the reduction in the number of false positives. In the case of RBP p08 we can note a much more marked increase in the number of not significantly different pairs (S2NS up to 17%), i.e. a reduction in the number of false positives; on the other hand, the increase in the number of significantly different pairs (NS2S around 2%), i.e. the reduction in the number of false negatives, is more marginal.
Why do we observe such a different behaviour between RBP p03 and RBP p08? If we look at Figure 4, we can see that RBP p03 condenses values at the top and the bottom of the range of possible values, in spans with a very small range of values but containing the same number of runs. As a consequence, when the ranked version of RBP p03 equi-spaces these values, runs that before were very close, and possibly not significantly different (NS) in the ranked version become more distant, and possibly significantly different (S); and this can explain why the NS2S case is more prominent for RBP p03. On the other hand, RBP p08 uses all the possible range of values but very few runs, roughly 20% packed at the bottom and at the top, cover almost 50% of the range of values while the remaining 80% of the runs, in the middle part, cover the other 50% of the range. Therefore, when we pass to the ranked version of the measure, very few runs which were very distant, and possibly significantly different (S), become closer, and possibly not significantly different (NS); viceversa, many runs which were very close, and possibly not significantly different (NS), may become a little bit more distant, and possibly (but not necessarily) significantly different (S). As a consequence, the effect on S2NS is more prominent than the one on NS2S.
In the case of DCG we observe a behaviour similar to the one of RBP p08, being the increase on S2NS and the reduction in the false positives even more prominent. If we look at Figure 5, we can see how DCG is sharper than RBP p08 at the top and bottom of the range -less than 10% of the runs account for almost 50% of the range of values -making even fewer runs falling more apart.
Finally, RR exhibits both effects: a very remarkable increase in S2NS, i.e. a reduction in the false positives, and a sizeable increase in NS2S, i.e. a reduction in the false negatives, causing an overall change ∆% up to a 72%. If we consider Figure 3, we can see how most of the runs, over 90%, are concentrated in just 4 possible values which are quite distant, possibly making them significantly different (S); when we move to the ranked version, these 4 values become much closer, possibly making the runs not significantly different (NS); and this explains the big S2NS effect. Vice versa, few runs, less than 10%, account for just the 20% of the range of values in the lower quartile; when we move to the ranked version, these values become more distant, possibly making the runs significantly different (S); since this change may affect a smaller number of runs, this explains why NS2S tends to be more moderate with respect to S2NS.

Measures depending on the recall base
As in the case of the correlation among measures, a word of caution has to be made remembering that in the case of measures depending on the recall base our approach is just a surrogate, which improves the "intervalness" of a measure but stretches the steps of the scale. Therefore, all the changes in the significantly different system pairs should be taken as tendency rather than exact quantification.
From a glance at Table 6 we can note as, in this case, also the significance tests assuming just an ordinal scale, with the exception of the Sign and Friedman tests, are affected by the transformation to an interval scale for an overall change ∆% up to a 51%. This further confirms that aggregating across topics when the recall base changes can cause variations which go well beyond the loss of "intervalness".
As another general trend, we can see that significance tests based on an interval scale assumption are generally more affected, since they experience both the violation of their assumptions and the effect of the recall base.
If we consider Recall, we can see how the most prominent effect is the underestimation of significant differences with a very big increase in the number of significantly different pairs (NS2S), i.e. a reduction in the number of false positives, up to a striking 558% for the one-way ANOVA. Considering that the interval scale behind Recall is the same as the one behind Precision, these figures tell us how big is the loss of power for Recall, mostly due to the impact of aggregating across topics with different recall bases.
In the case of nDCG we can observe a behaviour quite similar to one of DCG, with an overall change ∆% just a bit bigger than the one of DCG. Considering that both DCG and nDCG share the same interval scale, this further suggests to use DCG to avoid the further bias due to the recall base.

Conclusions and Future Work
We have addressed the problem that IR measures typically are not interval scales. This issue has severe consequences: you should neither compute means, variances, and confidence intervals nor perform statistical significance tests which assume an interval scale. We have provided a detailed discussion on the motivations and needs behind the interval scales, both in the general field of the representational theory of measurement and in the IR context in particular, presenting viewpoints and opinions both supporting and opposing these two "prohibitions". It is a matter of fact that these two "prohibitions" have been constantly overlooked in the IR community. However, when applying improper methods, the results should not be called valid (according to general scientific standards), especially before the impact of these violations has been thoroughly investigated, as has so far been the case in IR.
The main motivation for IR measures not being interval scales is that their values are not equispaced. Therefore, we have proposed a straightforward yet powerful way to turn any measure into an interval scale by considering how all the possible runs are ranked by the measure and keeping the unique ranks, i.e. after removing tied values, as values of the mapped measures. These ranks are equi-spaced by construction and preserve the same order of runs of the original measure. In this way, we obtain an interval scale able to represent the order of runs produced by the original measure.
We have also shown that the situation in IR is worsened by the fact that mixing runs of different length and different recall bases for different topics actually means mixing different scales, being them interval or not. Therefore, computing aggregations across runs and topics in such a way can lead to invalid results. While the run length issue can be mitigated by ensuring that all the runs have the same length, the recall base one is more problematic, since you cannot force a single recall base for all the topics. Therefore, this discourages the use of measures depending on the recall base.
Overall, this discussion led us to raise the fundamental question that IR should be more concerned with being able to rely on meaningful statements, i.e. statements whose truth values do not change when you perform legitimate transformations of the underlying scale, since they ensure for more valid and generalisable inferences.
Relying on several TREC collections, we have conducted a thorough experimentation on several (popular) state-of-the-art evaluation measures in order to assess the differences between using an evaluation measure and its interval-scaled version.
The correlation analysis has shown that the relationship between evaluation measures and their interval-scaled version matches the expected theoretical properties and that not using an interval scale somehow inflates the differences among evaluation measures. Notably, RR represents an exception since its departure from being an interval scale makes it look to be more similar to other measures than what it actually is.
Most importantly, the correlation analysis provides us with a rough estimator of how much interval scale an evaluation measure is and it represents the first attempt to quantify how much evaluation measures depart from their assumptions.
The analysis on many different types of statistical significance tests has clearly shown the impact of passing from an evaluation measure to its interval-scaled version. In particular, for measures not depending on the recall base, the transformation provides benefits in terms of reduced Type I error and some increase in power of the test. While for measures depending on the recall base, it produces sizeable improvements in terms of Type II error and power of the test, still delivering substantial enhancements in terms of Type I error. Even apart from any interpretation in terms of Type I and Type II errors, we observed an overall mean change around 25% in the decision about which systems are significantly different and which are not.
Our results on both the correlation analysis and the statistical significance tests open the question about which claims and findings in the IR literature would be impacted by these difference or, in other terms, which statements made in IR so far would be actually meaningful.
The main limitation of the proposed approach is practical, since first you need to generate all the possible runs of REL N and then you have compute the evaluation measures on all these runs. For increasing values of N , and even more in the case of multi-graded relevance, this becomes practically infeasible. Therefore, our future work will concern approximating this generation process in order to make it possible to deal with runs of whatever length.