Generalised Score Distribution: A Two-Parameter Discrete Distribution Accurately Describing Responses from Quality of Experience Subjective Experiments

—Subjective responses from Multimedia Quality Assessment (MQA) experiments are conventionally analysed with methods not suitable for the data type these responses represent. Furthermore, obtaining subjective responses is resource intensive. A method allowing reuse of existing responses would be thus beneﬁcial. Applying improper data analysis methods leads to difﬁcult to interpret results. This encourages drawing erroneous conclusions. Building upon existing subjective responses is resource friendly and helps develop machine learning (ML) based visual quality predictors. We show that using a discrete model for analysis of responses from MQA subjective experiments is feasible. We indicate that our proposed Generalised Score Distribution (GSD) properly describes response distributions observed in typical MQA experiments. We highlight interpretability of GSD parameters and indicate that the GSD outperforms the approach based on sample empirical distribution when it comes to boot-strapping. We evidence that the GSD outcompetes the state-of-the-art model both in terms of goodness-of-ﬁt and bootstrapping capabilities. To do all of that we analyse more than one million subjective responses from more than 30 subjective experiments. Furthermore, we make the code implementing the GSD model and related analyses available through our GitHub repository: https://github.com/Qub3k/subjective-exp-consistency-check.


I. INTRODUCTION
T HERE are phenomena that require gathering opinions from a panel of people.One significant example here is the notion of Quality of Experience (QoE).Contrary to the Quality of Service (QoS), the QoE also depends on how a user of a system perceives its performance (with the word perceives being the most important here).Although technical factors do influence the QoE, ultimately, it is a subjective opinion of a user that represents the most direct indication of the QoE.(See Sec.2.2.2 of [1] for a formal definition of the QoE.) Multimedia Quality Assessment (MQA) is a sub-field of the QoE related research activities.It focuses on understanding how people perceive quality of multimedia content, effects of its processing and performance of multimedia services.It is a common and recommended [2][3] practice to organise experiments in which a panel of observers provides its opinion on quality of multimedia materials presented.We refer to such experiments as subjective experiments and to opinions provided by the panel of observers as subjective responses.Importantly, we narrow our discussion down to subjective experiments in which participants judge technical reproduction quality of stimuli presented only.In other words, we do not take into account subjective experiments, where observers voice their opinion regarding content of stimuli (e.g., plot of a story or artistic properties of an image).
Lack of access to ground truth information is an inherent feature of subjective experiments.Differently put, we observe subjective responses, but have no way of directly measuring quality of a given stimulus.One solution to this problem is gathering a large number of responses per stimulus.This way, we make sure any summary statistic we use to estimate stimulus quality is well reflecting population level opinion.More and more researchers follow this intuition and switch from small scale controlled experiments to large scale crowdsourcing experiments [4].Unfortunately, switching to crowdsourcing experiments usually corresponds to less precise measurements.On the other hand, organising large scale controlled subjective experiments is money-and time-intensive.For these reasons we want to learn as much as possible from limited information controlled subjective experiments provide.
To fully use information controlled subjective experiments provide we cannot rely on summary statistics only (from which the Mean Opinion Score or MOS is the most popular [2] [3]).Instead, we need to construct models that try to capture the underlying, unobservable structure of subjective responses and how this structure maps to quality.To construct such models we use various assumptions based on domain knowledge and experiences from previous subjective experiments.
There are better and worse models.Likewise, there are tools to assess how well a model performs.We claim that using models reflecting data type subjective responses represent is arXiv:2202.02177v1[cs.MM] 4 Feb 2022 a better approach than assuming that continuous models can be applied to discrete data.For one thing, models reflecting underlying data type generate interpretable results.Increased interpretability makes it easier to understand the result and thus protects against ill posed conclusions.

A. Problem Statement and Contributions
Subjective responses from Multimedia Quality Assessment (MQA) experiments are conventionally analysed with methods not suitable for data type these responses represent.In particular, continuous models are used even though subjective responses are discrete in most cases [5] [6].Furthermore, obtaining subjective responses is money-and time-consuming.A method allowing to reuse and build upon existing subjective responses would be thus beneficial.
Applying improper data analysis methods may lead to results that are difficult to interpret.This may result in erroneous conclusions.Liddell and Kruschke provide a convincing overview of mistakes that arise when analysing data using an improper model [7].One of our goals is to protect researchers analysing responses from MQA experiments against mistakes Liddell and Kruschke mention.
If it comes to building upon existing subjective responses, the approach is especially important if used to generate large samples from small real-life samples.This procedure is also referred to as bootstrapping.Properly applied bootstrapping gives a chance of producing sample sizes sufficient for developing machine learning (ML) visual quality predictors.Naturally, reusing existing subjective responses is also resource friendly.
We show that using a discrete model for data analysis of responses from MQA subjective experiments is feasible.We also present benefits stemming from this approach.Specifically, we indicate that our proposed Generalised Score Distribution (GSD) properly describes response distributions observed in typical MQA experiments.We also highlight interpretability of GSD parameters.This GSD feature makes it possible to easily describe and intuitively understand non-trivial dependencies between various response distributions.At last, we point out that the GSD outperforms the traditional approach based on a sample empirical distribution when it comes to bootstrapping.
Being more suitable for bootstrapping than sample empirical distribution the GSD can generate a large data set of responses by taking advantage of only a small data set of real-life responses.In turn, large data sets generated this way may give a chance to create next generation ML based perceptual visual quality predictors.This is because ML based solutions are data hungry by nature, and typical MQA experiments are capable of gathering only few dozens of responses per stimulus.Moreover, knowing a correct data model (which we show the GSD is for typical MQA experiments) allows proposing a parametric hypothesis testing framework.Using such a framework results in higher power (when compared to conventionally used here non-parametric methods) and thus allows to reduce costs related to organising subjective experiments.This is because more powerful statistical methods allow to detect smaller effect sizes, while keeping the sample size constant.Finally, interpretability of GSD parameters makes it easier to summarise subjective responses and perform nonmisleading intuitive inferences based on this summary.
With this work we put forward the following contributions: • We evidence that analysing subjective responses from MQA experiments with a discrete model (specifically, the GSD) is feasible and brings easy to interpret results.• At last, we demonstrate that the GSD outperforms the state-of-the-art model when it comes to bootstrapping and goodness-of-fit testing.The main idea of the paper is that we want to convince the MQA research community that using the GSD model to analyse subjective responses is better than following current recommendations and practices put forward in the literature.Specifically, we want to show that the GSD outperforms nondisrete models used in the literature and, when it comes to bootstrapping, that the GSD also performs better than the standard approach based on a sample empirical distribution.To make it easier for others to use our work, we invite everyone to visit our GitHub repository (https://github.com/Qub3k/subjective-exp-consistency-check).There, we provide a code allowing to analyse subjective responses with the use of the GSD model and to reproduce a significant part of results presented in this paper.

B. Related Works
There is a trend in the MQA community to favour response distribution analysis over relying on summary statistics (e.g., the MOS) only.An important recent contribution in this topic is the work by Seufert [8].There, he highlights fundamental advantages of considering response distributions over summary statistic-based evaluations.Hoßfeld et al. take this idea further and show how to approximate response distributions given a QoS-to-MOS mapping function [9].With our work we follow the trend of response distribution analysis.At the same time, we indicate that interpretable GSD model parameters can serve as summary statistics well describing underlying response distribution.
Modelling individual responses generation process is another important thread of MQA research focusing on response distribution analysis.The idea was first proposed by Janowski and Pinson and termed subject model 1 [10].Li and Bampis took on the approach and proposed an extended subject model [5].In their formulation of the model they considered subject bias, subject inconsistency and stimulus ambiguity.Reference [6] proposes an updated, simpler version of the same model.Authors of [6] convincingly show that their model addresses shortcomings of subjective data analysis methods put forward in several MQA-related ITU recommendations.Our work extends this arc of research.We model individual 1 The word subject refers to subjective experiment participant.
responses generation process with the Generalised Score Distribution (GSD) model.The model first came to light in [11], where we showed how it could be applied to check subjective responses consistency.
We are not the first ones to notice that subjective responses modelling approach should reflect data it operates on.Specifically, both [12] and [13] propose models taking into account ordinal nature of subjective responses coming from MQA experiments.
Considering the literature review presented, our work is novel in two respects.First, our proposed GSD model is the first two-parameter model to model the complete variance range observable for subjective responses expressed on an ordinal scale.Second, to the best of our knowledge, we are the first ones to show that our subjective responses generation process modelling approach outperforms the standard approach based on empirical distribution when it comes to bootstrapping.

II. METHODOLOGY
In this section we describe the methodology we use to substantiate the claims made in the introduction.Section II-A describes our idea of treating subjective responses from MQA experiments as realisations of a discrete random variable.Section II-B shows how we test the goodness-of-fit of the models we take into account.It also presents how we interpret resulting p-values.Section II-C highlights data sets we use to test the GSD on real data.Finally, Sec.II-D details the procedure we use to test GSD's performance when it comes to subjective responses bootstrapping.

A. Subjective Response as a Random Variable
We propose to think about responses from MQA subjective experiments as realisations of a discrete random variable U .Since we focus on responses expressed on the 5-level Absolute Category Rating (ACR) scale (cf.Sec.6.1 of [14]), U can take values from the {1, 2, 3, 4, 5} set.To make the distribution of U practically useful, we need to parametrise it.Our experiences show that distributions with one parameter do not properly fit real data.Thus, we focus on two-parameter distributions.The following shows a general formulation of such distributions where F () is a cumulative distribution function, λ is a parameter describing central tendency of the distribution and θ expresses distribution spread.Importantly, we assume that F () reflects response distribution of each stimulus in a subjective experiment.Per-stimulus values of λ and θ define the exact shape of F (). Now, there are at least two approaches to proposing the exact formulation of F ().The first one (which is more popular in the MQA literature) is to assume that subjective responses follow a continuous normal distribution.The second one (which we take in this paper) is to assume responses follow a discrete distribution and, more precisely, the Generalised Score Distribution (GSD). 2he approach assuming subjective responses follow continuous normal distribution is best described by introducing an intermediate continuous random variable O ∼ N (µ, σ 2 ), where µ describes the mean and σ 2 the variance of the normal distribution.Since U is discrete and O is continuous, we need to introduce a mapping between the two.In other words, O must be discretized and censored as follows for s = {2, 3, 4} and Such a construct (i.e., a thresholded cumulative normal distribution) is quite popular in latent variable analysis [15].Thus, we follow the appropriate nomenclature and refer to this model as Ordered Probit. 1) Generalised Score Distribution: Our approach to modelling subjective responses does not require any mapping between an intermediate random variable and U .This is because the GSD already is a discrete distribution.Thus, we can directly write U ∼ GSD(ψ, ρ), where ψ expresses so called true quality and ρ expresses responses spread.The true quality parameter ψ can be intuitively understood as a mean response for a given stimulus, if we were to ask for opinion the complete population of observers.Contrary to Ordered Probit's µ, ψ reflects the 5-level ACR scale and is bounded to the [1,5] range. 3The other GSD's parameter, ρ, is a linear function of V (U ) (i.e., variance of U ). Furthermore, ρ is bounded to the [0, 1] interval and expresses what portion of possible variance is present in realisations of U .Please note here that any discrete distribution with a limited domain (e.g., U ∼ F (λ, θ)) has its mean value E(U ) and variance V (U ) bounded (cf.Fig. 2).One more important property of ρ is that it represents responses confidence.In other words, it is inversely proportional to variance observed in responses (the higher the observed variance, the lower the value of ρ).Yet another way to put it is to say that the greater the value of ρ, the closer to ψ observed responses are.Importantly, the GSD is able to model the complete range of possible variances for a given M -point scale (with M ∈ N : M > 1).For more details regarding the GSD we refer the reader to [16].
To make GSD's description more concrete, let us take a look at its internal structure.We start by showing a more detailed form of the U ∼ GSD(ψ, ρ) expression: where expresses uncertainty regarding mean response represented by ψ. ψ is one constant number estimated for a stimulus of interest.Notice that ψ = E(U )., on the other hand, follows a distribution parameterised with a single parameter ρ.What is more, 's distribution satisfies the following two criteria: (i) its mean equals zero and (ii) its variance is a linear function of ρ.In Appendix A (see the supplemental material) we show the exact formulation of 's distribution.Here, we only mention that this distribution is a mixture of the following distributions: binomial, beta-binomial and one-or two-point distribution (whether one-or two-point distribution is used depends on the value of ψ).Importantly, we reparameterise the distributions in the mixture to make them satisfy the two criteria that 's distribution must follow.As a result, the reparameterised distributions in the mixture depend only on a single parameter ρ.Fig. 1 shows various realisations of the GSD for different values of ψ and ρ.Please notice how flexible the GSD is.For example, in Fig. 1c, for the case of ρ = 0.38, the GSD takes the form of a distribution with two modes (one mode at response 1 and another at response 5).Apart from this extreme example, GSD's shape follows common sense intuition regarding response distributions observed in typical MQA subjective experiments.

B. G-test and P-P Plot
In order to validate if a distribution (or a model) fits specific data we have to perform a two-step procedure.The first step is to estimate distribution parameters for a sample of interest.The second step is to test a null hypothesis stating that the sample truly comes from the assumed distribution (GSD or Ordered Probit in our case), given the parameters estimated in the first step.We use a standard likelihood ratio approach to test the goodness-of-fit (GoF) of the two models (the GSD and Ordered Probit).More precisely, we use the so called G-test of GoF (cf.Sec.14.3.4 of [17]).Because sample sizes we consider are predominantly small, we do not use the asymptotic distribution for calculating the p-value.On the contrary, we estimate the p-value utilising a bootstrapped version of the G-test (see Appendix B in the supplemental material).(For broader theoretical considerations on the topic please take a look at [18].) Since each MQA subjective experiment we analyse contains multiple stimuli, we need to perform the G-test multiple times (as many as there are stimuli in the experiment).The result of each G-test is a p-value.This means we get a vector of p-values for each experiment we take into account.To be able to efficiently draw conclusions regarding a vector of p-values we use p-value P-P plots (where P-P stands for probabilityprobability) [19].For a detailed discussion regarding p-value P-P plots for the GSD we refer the reader to [11].

C. Data Sets
To test the GSD in practice we make use of more than one million individual subjective responses (exactly 1 183 696).We take into account responses coming from 33 subjective experiments that assess quality (or other traits) of more than nine thousand stimuli (exactly 9 290).Table I presents an overview of data sets we use.Importantly, we classify data sets into three types: (i) typical, (ii) broadly understood and (iii) non-MQA.The types reflect how much a given data set follows best practices and recommendations regarding organising MQA experiments.Typical experiments tightly follow best practices and recommendations.Broadly understood experiments follow these best practices and recommendations generally, but deviate from them in some aspects.Finally, non-MQA experiments are not MQA experiments at all.We include these to check GSD's performance on data outside of GSD's intended application scope.Please note that one data set may correspond to multiple experiments (cf. the "No. of Exp." row of Table I).For example, the MM2 data set consists of 10 separate experiments.Thus, although we use data from 11 data sets, they amount to 33 experiments.
We do not provide detailed descriptions of the data sets here.Instead, in Table I we link to references describing each data set.The only exception to this rule is the NFLX data set.Since its description has not yet been published, we describe the data set briefly.
Experiments included in the NFLX data set investigated influence of per-scene quality changes on opinion of human observers.200 observers assessed quality of 320 stimuli. 4Ten seconds long video clips (without audio) were used as stimuli.The clips had a resolution of 1920x1080 pixels.Quality degradations were applied solely through video compression.However, since per-scene compression was used, quality switches occurred during playback as well.Contents spanned a wide range of categories and were taken from Netflix's catalogue.This made the experiments more ecologically valid, but also meant the clips could not be publicly shared.The clips were displayed on either a TV or a tablet (both with the native resolution of 1920x1080 pixels).What is more, some participants were asked to provide their opinion during video playback.They were encouraged to use a software slider displayed at the bottom of the screen.In total, four experiments were performed: (i) with the TV and the software slider, (ii) with the TV and without the slider, (iii) with the tablet and the slider and (iv) with the tablet and without the slider.Participants were recruited through a temporary working agency.Care was taken not to over-represent the 18 to 25 age group.All four experiments took place in a controlled environment and were generally following provisions of Rec.ITU-T P.913 [2].The experiments were performed in accordance with the Absolute Category Rating with Hidden Reference (ACR-HR) method (cf.Sec.7.2.2 of [2]).Thus, participants provided their responses using the 5-level ACR scale.

D. Bootstrapping
To compare GSD's generalisability to that of the empirical distribution (which is typically used for resampling), we introduce the following procedure.We start by generating M C  (e.g., M C = 10 000) bootstrap samples from the empirical probability mass function (EPMF) of the large sample.Importantly, we generate bootstrap samples with significantly fewer observations than those in the large sample (e.g., n = 24 observations in each bootstrap sample for N = 200 observations in the large sample).Now, we fit the GSD to each r-th bootstrap sample.This yields estimates of each response category probability (q r 1 , qr 2 , qr 3 , qr 4 , qr 5 ).We use those estimates to calculate the likelihood function for the large sample L r GSD .We repeat the procedure for each bootstrap sample, but this time using the EPMF of the bootstrap sample to find response category probability estimates (v r 1 , vr 2 , vr 3 , vr 4 , vr 5 ).Having the likelihoods for both the GSD (L r GSD ) and empirical distribution (L r e ) we introduce a statistic W r based on the quotient of the two values.In other words, we introduce a statistic based on the likelihood ratio: W r = ln (L r GSD /L r e ).Value of the quotient signifies which approach better describes the large sample.(Note that there are as many quotients as there are bootstrap samples.)Now, we use the quotients to estimate the probability p GSD that the GSD model-based estimates of response category probabilities in the large sample yield higher likelihood function value (L GSD ) than the likelihood function value we get if we use the EPMF-based estimates (L e ).We also do the same for the empirical distribution and estimate the probability that the EPMF-based estimates yield higher likelihood function value than that yielded by the GSD model-based estimates and denote this probability by p e .We calculate the 95% confidence interval for p GSD − p e = P (W r > 0) − P (W r < 0) and denote its lower (or left) bound as L and upper (or right) bound as R. If L > 0 then the GSD performs better than the empirical distribution.If R < 0 then the empirical distribution performs better.If [L, R] contains zero, there is no significant difference between the GSD and empirical distribution.We provide the precise description of the procedure given above in Appendix C (see the supplemental material).
Since we use the subsample to make inferences about the large sample, there is a risk of overfitting.Put differently, by fitting any model too precisely to the subsample we are at risk of finding model parameter estimates that are suboptimal from the point of view of the large sample.This is because the subsample represents only limited information about the large sample.Intuitively, we should not entirely trust the data we observe in the subsample.To address the issue we apply parameter estimation modification that prevents probability estimators we use from yielding response category probabilities equal to 0 (for any response category).In other words, we expect that, at the population level, there is no response category that would be assigned no observations (even if the estimation result for the subsample says something else).This results in modified estimation procedures for both the GSD and empirical distribution.The detailed estimation correction procedures we use are described in Appendix C-A (see the supplemental material).

III. RESULTS
We present here results reflecting our contributions mentioned in the introduction.Sec.III-A puts forward evidence supporting the claim that the GSD has easy to interpret parameters.Sec.III-B shows that the GSD well describes response distributions from typical MQA experiments.It also indicates that GSD does not perform well for atypical MQA and non-MQA experiments.Sec.III-C reveals that the GSD outperforms empirical distribution when it comes to subjective responses bootstrapping.At last, Sec.III-D evidences that the GSD outperforms the state-of-the-art model both in terms of goodness-of-fit testing and bootstrapping.

A. Interpretable Parameters
Fig. 2 presents how Ordered Probit model parameters map to the E(U ) and V (U ) space.In other words, the figure shows how parameters of the Ordered Probit model we use to describe observed data (cf.Fig. 2a and Fig. 2e) map to summary statistics computed directly on these observed data (Fig. 2b and Fig. 2f).Intuitively, Fig. 2a and Fig. 2e show us how the Ordered Probit model "sees" observed data.Fig. 2b and Fig. 2f show us how observed data actually look like in terms of two basic summary statistics (i.e., mean E(U ) and variance V (U )).Differently put, any point along any line in Fig. 2a or Fig. 2e corresponds to a fixed pair of Ordered Probit parameters.The Ordered Probit model with these parameters is then used to generate discrete responses (being realisations of random variable U ). Summary statistics (i.e., E(U ) and V (U )) computed on these generated responses yield a single point in Fig. 2b or Fig. 2f, respectively.(Note that these generated responses can be thought of as representing individual responses we observe in real subjective experiments.)Importantly, plots in Fig. 2 should be analysed in pairs, rowwise.Stated differently, the leftmost (red) line in Fig. 2a corresponds to the same data series as the leftmost (red) line in Fig. 2b.The same is true for Fig. 2e and Fig. 2f, and so on.When analysing Fig. 2, please also keep in mind that E(O) = µ and V (O) = σ 2 (cf.Sec.II-A for more context).
We want model parameters to accurately reflect phenomena occurring in observed data.For example, we naturally associate the µ parameter with the central tendency of observed data and the σ parameter with their variance.Thus, if we keep µ constant and increase the value of σ, we expect this should correspond to E(U ) staying constant and V (U ) to increase.However, this is not the case.Instead, keeping µ constant and increasing σ corresponds to changes in both E(U ) and V (U ).This can be observed by following same-coloured lines 5 in Fig. 2a and Fig. 2b.Specifically, let us take the leftmost (red) line from Fig. 2a.It corresponds to Ordered Probit's µ fixed at a value slightly larger than zero.Moving vertically upwards along this line, µ stays constant and σ increases.If we were to stop at various points along this line and generate discrete responses (being realisations of random variable U ) from the Ordered Probit model with µ and σ parameters fixed, we expect each such sample should have a constant and same expected value E(U ), but changing variance V (U ).The corresponding leftmost (red) line in Fig. 2b shows what E(U ) and V (U ) we actually observe when generating responses from the Ordered Probit model.As we can see, the samples generated do not have constant expected value.On the contrary, it changes in rather unexpected way, as we move along increasing values of σ.(The only exception to this rule is when µ = 3.)This property of Ordered Probit model parameters makes them counter-intuitive.Unfortunately, this is not the only limitation of Ordered Probit's parameterisation.Another one relates to how changes in µ correspond to changes in E(U ).Looking at Fig. 2e and Fig. 2f we see that the same range of µ values maps to different ranges of E(U ) values as the σ parameter changes.For example, let us compare the topmost pink curve (σ = 8) with the second topmost green one (2 < σ < 4).In Fig. 2e they both span the same range of µ values.However, in Fig. 2f, the pink curve corresponds to much narrower range of E(U ), when compared to the green curve.This leads us to another limitation of Ordered Probit's parameterisation.Although subjective responses we take into account here span the range from 1 to 5, the µ parameter takes values exceeding the 1-5 range.Practically speaking, although it is tempting to treat µ as an MOS-related measure, µ can and will exceed the 1-5 range (which the MOS never does).Thus, µ should not be intuitively interpreted as MOS counterpart for the Ordered Probit model.For completeness, we mention that the Ordered Probit model is able to describe the complete ghost-like area shown in Fig. 2b and Fig. 2f.However, this is only possible if we allow its parameters to change without bounds.In other words, when (µ, σ) ∈ (−∞, +∞)×(0, +∞).
The right hand side of Fig. 2 is GSD's counterpart of Ordered Probit plots on the left.Fig. 2c and Fig. 2g present GSD parameters space.Fig. 2d and Fig. 2h present the E(U ) and V (U ) space.(Note that there is an inverse relationship between ρ and V (U ).)As can be readily seen, the GSD does not suffer from problems inherent to the Ordered Probit model.In particular, keeping the ψ parameter constant and changing ρ parameter's value, we keep the E(U ) constant and vary V (U ) only.This means GSD's parameterisation allows for treating ψ as observed data's central tendency and ρ as a measure of their variability.Following same-coloured lines in Fig. 2c and Fig. 2d evidences how keeping ψ constant corresponds to constant E(U ).Please also notice that the same range of ψ values for different values of ρ, always corresponds to the same range of E(U ).We can take a curve of any colour from Fig. 2g and Fig. 2h, and see that it always spans the entire range of E(U ).Although the bumpy shape of multiple curves in Fig. 2h may seem counter-intuitive at first, it reflects an important property of ρ.The ρ parameter expresses what ratio of available variance for a given mean is present in observed data.Thus, to keep this ratio constant across different means, the curve has to follow the bottom part of the E(U ) and V (U ) space.Thanks to its properties, ρ = 0.5 means that we deal with data being at the midpoint between minimum and maximum possible variance.Finally, GSD parameters cover the entire space of E(U ) and V (U ), and do so staying in the well defined bounds.Specifically, (ψ, Practically speaking, ψ can be regarded as GSD's counterpart of the MOS.We should note here that both models share one limitation.Even when data variability related parameter (σ or ρ) stays constant and central tendency related parameter (µ or ψ) changes, V (U ) changes across different values of E(U ).Ideally, V (U ) should follow variability related parameter and stay constant across chaning E(U ).However, since we are dealing here with discrete, limited domain process (only values {1, 2, 3, 4, 5} can be observed), the mean is naturally coupled with variance.In other words, changes to the mean inherently influence variance.

B. Good Description of Typical MQA Experiments
Fig. 3a shows results of applying a bootstrapped G-test of goodness of fit to responses from typical MQA experiments, as modelled by the GSD or by the Ordered Probit model.If any of the two models truly reflects response distributions observed in real data, a related p-value histogram should resemble the uniform distribution (or any other nonincreasing distribution) in the region of low p-values (roughly between 0 and 0.2) [11].It is easy to see that the histogram for the Ordered Probit model does not resemble the uniform distribution.The most important indication of this fact is the height of the leftmost bar, which is significantly greater than that of the second leftmost bar.GSD's histogram does resemble the uniform distribution for the p-values region of interest.However, to decisively assess GSD's performance we need to resort to p-value P-P plot (cf.Fig. 4).Since all GSDrelated data points fall below the black diagonal line, we can safely say that results do not contradict the null hypothesis of the GSD truly reflecting response distributions observed in real data.In other words, the GSD well reflects response distributions observed in typical MQA experiments.The same is not true for the Ordered Probit model.This is indicated by all Ordered Probit related data points falling above the black diagonal line.Differently put, the Ordered Probit model is not properly reflecting response distributions observed in typical MQA experiments.
If we now also take into account MQA experiments that do not strictly follow international recommendations (let us call them broadly understood MQA experiments), we see that performance of the both models deteriorates (cf.Fig. 3b).The best indication of this is the height of the leftmost bar.On both histograms its height is significantly greater than the reference height corresponding to approximately 279 stimuli or 5% of all stimuli investigated.We do not show a related P-P plot since it simply reaffirms both models do not reflect response distributions observed in real data.
We also investigated how the GSD and Ordered Probit models would perform on a data set not related to MQA experiments.For this we chose two data sets popular in the movie recommendation systems research community: (i) MovieLens 1M [28] and (ii) Personality 2018 [29].Although the two data sets are outside of MQA, they use the 5-level Likert scale to collect subjective responses.Our hypothesis was that since the GSD performed well on MQA data using the 5-level Liker scale, then maybe it would perform well on these data sets as well.However, looking at Fig. 3c, we can clearly see that both the GSD and Ordered Probit models do not reflect response distributions observed in the data.In other words, neither the GSD nor Ordered Probit model well describe response distributions observed in data concerning movie recommendation systems.

C. Better Than Empirical Distribution
It is interesting to check whether the GSD brings any advantage if it comes to generalisability.We define generalisability as the ability of a model to capture large sample phenomena when observing only a subsample of the large sample.In particular, we would like to check whether the GSD better captures large sample's distribution shape in comparison to the empirical distribution of the subsample.Put differently, we would like to check whether the GSD is better suited for bootstrapping than is the empirical distribution.If this proves to be the case, then the GSD could be used for resampling.One important consequence of this would be a chance to build better machine learning (ML) models aimed at predicting subjectively perceived multimedia quality (which is a difficult, important and still open challenge).It is often the case in the field of Multimedia Quality Assessment (MQA) that only up to 30 responses per stimulus are available.If one wants to create an ML model this may prove insufficient and resampling must be applied to generate more responses per stimulus.Should the GSD prove to be better for bootstrapping than the empirical distribution (which is typically applied in this context), the GSD could be used to generate more reliable samples during resampling.
To test GSD's generalisability capabilities in practice we use data from four MQA studies: (i) MM2 [22], (ii) VQEG HDTV Phase I [21], (iii) NFLX (cf.Section II-C to learn more about this study) and (iv) ITERO [25].From these studies we extract responses for selected stimuli.More precisely, we select stimuli with at least 144 responses.This way we get 234 stimuli.The number of responses per stimulus spans from 144 to 228.There are only four unique numbers of responses per stimulus.Table II shows the four numbers of responses and the count of stimuli with a given number of responses.Furthermore, it shows from which study a given set of stimuli was taken.
The NFLX study contains responses given to stimuli displayed on one of the two display devices-tablet and TV.In principle, responses for different display devices shall be analysed separately.However, since the responses for the two devices are highly correlated and since the same visual content was presented to participants during the sessions with each device, we decide to include in this analysis the combined responses from the two display devices.
If it comes to the HDTV Phase I study we only focus on responses provided to the so called common set of stimuli.The stimuli from the common set were presented to participants in all the six experiments that were part of the HDTV Phase I study.Although the six experiments were conducted by different research teams and using different display devices, the experimenters declared that actions were taken to make the six experiments similar to each other.Specifically, all video stimuli were displayed with the same resolution and in a room conforming to guidelines of Rec.ITU-R BT.500-11.Following experimenters declaration we combine responses for the common set stimuli.That is, we treat the six experiments, with 24 participants each, as one large experiment with 144 participants.This way we end up with 24 stimuli (that many are in the common set), each having 144 responses.
The MM2 study is a set of ten experiments.Responses in the experiments were collected by six laboratory teams from four countries.Different subject pools and environments were used in each experiment.The common denominator of all the experiments was the same set of 60 audiovisual stimuli and very similar test procedure.According to the authors of [22] the experiments were highly repeatable.Thus, we combine responses from the ten experiments.This yields 213 responses (that many participants in total took part in the ten experiments) for each of the 60 audiovisual stimuli.
The ITERO study collected responses from 27 subjects, who rated the same set of 110 stimuli.The study was carried out by three research teams.The experiment design was atypical of how MQA experiments are usually carried out.Subjects were instructed to repeat the experiment ten times.In total, 110 stimuli were assigned 228 responses each (not all subjects repeated the experiment ten times).Although the subjects were allowed to repeat the experiment at their leisure and majority did not use the same display device, we combine the responses from the ten repetitions.In other words, we treat the responses as though they come from one large subjective experiment with 110 stimuli and 228 subjects (in which each subject rates the same set of 110 stimuli).
We use three small sample sizes, i.e., n = {12, 24, 50}.This way we can observe how the GSD performs (when compared to the empirical distribution) for different fractions of the large sample information available.Intuitively, we expect the empirical distribution's performance to improve as the small sample size increases.If the GSD proofs to perform differently than the empirical distribution we would observe how the increasing small sample size influences the difference between the two approaches.We emphasise here that the increasing small sample size always favours the empirical distribution.On the other hand, the performance of the GSD depends on how well it fits to the distribution of responses observed in the large sample.If the fit is good, the increasing small sample size also favours the GSD.If the fit is poor, the increasing small sample size does not necessarily improves GSD's performance.
Fig. 5a presents results of the analysis.They take the form of three histograms.These histograms visualise probability differences pGSD − pe for the three investigated small sample sizes (i.e., 12, 24 and 50).Now, greater probability mass to the right of 0 indicates that the GSD performs better than the empirical distribution.Greater probability mass to the left of 0 corresponds to the opposite situation, i.e., empirical distribution outperforms the GSD.To make the analysis easier, we show in the plot red hatched bars that indicate for how many stimuli the GSD outperforms the empirical distribution (the red hatched bar on the right) and for how many the empirical distribution performs better than the GSD (the red hatched bar on the left).Blue-coloured parts of the bars represent statistically insignificant probability differences.When assessing which approach performs better, these blue parts of the bars are discarded.
Clearly, the GSD outperforms empirical distribution for all three small sample sizes.The effect is most clearly visible for the small sample size of 12.As expected, as the size of a small sample grows, empirical distribution's performance improves.Nevertheless, even for as many as 50 observations per small sample (which rarely happens in typical MQA subjective experiments), the GSD still significantly outperforms the empirical distribution.
Results show that the GSD is a better choice than the empirical distribution (which is typically used in this context) when it comes to resampling of subjective responses from MQA studies.This opens up an opportunity to train better ML models for the MQA applications, without having to organise large subjective experiments (i.e., experiments with a large number of participants).This result is yet another indication of GSD's superiority over methods typically used for MQA data analysis.

D. Comparison With the State-of-the-Art Model
To evaluate GSD's performance against a state-of-the-art solution, we compare it with the fascinating model presented in [6].To make the description easier to comprehend, we refer to the model from [6] as Li2020 model.We selected the Li2020 model for the comparison, since as of the time of writing this, this is the most recent and, at the same time, the most popular model in the MQA community.It should be sufficient to say that the Li2020 model is undergoing the standardisation process.
The GSD operates only on responses given to a single stimulus.In other words, the GSD needs to know about only these subject responses that were assigned to a single stimulus of interest.The Li2020 model requires information regarding all responses of all subjects that scored the stimulus of interest.Differently put, even though we are interested in responses assigned to a single stimulus, to estimate model parameters, we need to know about all responses assigned by a given subject to all other stimuli in the experiment.Neither the bootstrapped G-test, nor the bootstrapping effectiveness test we use, satisfy Li2020 model's requirements.Both tests rely on the assumption that responses assigned to the single stimulus of interest are sufficient for the model.
Not to abandon the comparison between the GSD and Li2020 models completely, we simplify the Li2020 model.Specifically, we make it function in a similar manner to the GSD.Put differently, we make the Li2020 model only require responses assigned to a single stimulus of interest.This results in a model defined by a Gaussian probability density function (PDF) with its mean (µ) equal to sample mean (i.e., the MOS) and variance (σ 2 ) equal to the sample variance (s 2 ).(Note that "sample" means here a set of responses assigned to a single stimulus of interest.)In the following text, we refer to the modified Li2020 model as the Simplified Li2020 model or SLI for short.
After estimating Simplified Li2020 model's parameters, we end up with a continuous normal distribution N (MOS, s 2 ).Since real subjective responses take the form of discrete numbers ({1, 2, 3, 4, 5} in our case), we need to map from the continuous domain of the normal distribution to the 5-level scale of interest.For this, we proceed in the same manner as we do when fitting the Ordered Probit model to the data.Specifically, we apply equations ( 2), ( 3) and (4).
Although the Ordered Probit and Simplified Li2020 models look very similar, they are not identical.The key difference lies in model parameters estimation.The Simplified Li2020 model assumes observed subjective responses are realisations of a continuous random variable following the normal distribution.Importantly, realisations of such a random variable can take any value (from plus to minus infinity).Hence, observing values from the {1, 2, 3, 4, 5} set exclusively is, probabilistically speaking, very rare.The Simplified Li2020 model ignores this fact and fits the normal distribution to these data using sample mean and variance. 6Contrary to the Simplified Li2020 model, the Ordered Probit model does not assume observed subjective responses are realisations of a continuous random variable.More precisely, the continuous normal distribution present in the Ordered Probit model is treated as a latent trait of the data.This latent continuous distribution is always mapped to a discrete scale of interest first (cf.( 2), ( 3) and ( 4)), before fitting the model.Finally, although we describe here the Simplified Li2020 model, the same discussion applies to the original Li2020 model.Differently put, the original full model also assumes that observed subjective responses are realisations 6 This approach exemplifies what Liddell and Kruschke warn against in [7]. of a continuous normal random variable (even though these responses only take values from the {1, 2, 3, 4, 5} set).
1) G-test of Goodness-of-Fit: Let us first check how the SLI model performs when it comes to describing response distributions observed in typical MQA experiments.For this, we will use the same G-test-based procedure, as the one we applied to the GSD in Sec.III-B.Fig. 4 shows the GSD, SLI and Ordered Probit compared in terms of G-test results.Since only GSD data points fall below the black diagonal line, it is the only model that properly reflects response distributions observed in typical MQA experiments.What is more, the SLI model performs worse both from the GSD and from the Ordered Probit models.Performance inferior to the Ordered Probit model may be ascribed to SLI's lack of mapping to the 5-level scale, when estimating its parameters.In short, the SLI and Ordered Probit models do not properly describe response distributions observed in typical MQA experiments, whereas the GSD model does.
2) Bootstrapping: We now test whether the SLI model can beat the GSD if it comes to bootstrapping.For this, we apply the same procedure to the SLI model, as the one we applied to the GSD model in Sec.III-C.The result is given in Fig. 5b.As we can see, the SLI model performs better than the empirical probability mass function (EPMF) for small samples of size 12 and 24.However, it performs worse than the EPMF for small samples of size 50.When we compare the results of the SLI, with those of the GSD, we see that the latter outperforms the former in all cases.

IV. DISCUSSION
Section III-A shows that GSD's parameterisation is more interpretable and intuitive when compared to Ordered Probit's one.Importantly, Ordered Probit's parameterisation is not erroneous.Still, using it may lead to mistaken conclusions if used carelessly.Our results indicate that GSD's parameterisation should be preferred over Ordered Probit's one.This insight is relevant for the MQA research community since many practitioners decide to first try using continuous models (Ordered Probit being one of them) when they start working with subjective responses modelling.Arguably, their preference to choose continuous models comes from easier availability of methods operating on such models.We can also argue that continuous models get more attention during standard statistics and probability classes and thus naturally come to mind when thinking about data modelling.We want to protect MQA practitioners against potential mistakes arising from using continuous models to analyse discrete data.Our results indicate that discrete models, and the GSD in particular, are viable and better alternatives to continuous models when it comes to subjective responses analysis (with responses expressed on a discrete scale).
Section III-B reveals that the GSD well describes responses from typical MQA experiments.This property of the GSD indicates that the GSD can serve as a basis for building parametric methods for subjective responses analysis.Please note that parametric methods have greater power than nonparametric methods do.For example, a parametric hypothesis testing framework can detect a smaller effect size for a given sample size, when compared to a nonparametric framework.This increased power may prove essential when analysing responses from controlled subjective experiments.Since such experiments usually take place in a laboratory environment and require a direct involvement of a researcher, they can become resource intensive (both money-and time-wise).It is desirable (or sometimes necessary) to reduce sample size of such experiments. 7A parametric GSD-based data analysis framework would help address this problem.Due to its parametric nature it would be able to detect smaller differences between various test conditions for a given sample size, in comparison to other nonparametric methods.
Importantly, neither the GSD nor the Ordered Probit model properly describe response distributions observed in broadly understood or non-MQA subjective experiments (cf.Fig. 3b and Fig. 3c).This means the models are not globally applicable to modelling subjective responses expressed on the 5level Likert scale.Potentially, more complicated models (i.e., models with more than two parameters) are necessary to model phenomena present in responses from broadly understood or non-MQA experiments.
Sec. III-C makes it clear that the GSD outperforms the traditional approach (based on empirical distribution) when it comes to subjective responses bootstrapping.This result means that whenever there is a need to generate more results from a small real-life sample, the GSD should be preferred over empirical distribution to perform resampling.Such resampling may prove necessary when building an ML-based perceptual quality predictor.Building such a predictor requires a significant amount of data.Sample sizes sufficiently large may be difficult to generate through a controlled experiment.Thus, a small real-life sample can be collected through a controlled experiment.Then, the GSD-based bootstrapping can be used to enlarge the small real-life sample to a larger sample (of a size sufficient for building an ML-based perceptual quality predictor).Significantly, having such a mechanism at hand also addresses the issue of controlled experiments being money-and time-intensive.As previously, a small and not so expensive experiment may be organised to generate a small real-life sample of responses.This sample can then be enlarged using the GSD-based bootstrapping to achieve sample size that would otherwise require organising a larger and more expensive controlled experiment.We would like to remind the reader at this point that our results indicate that the mechanism described above applies to responses from typical MQA experiments exclusively.
Finally, Sec.III-D evidences that the GSD outperforms the state-of-the-art model, namely the Simplified Li2020 (SLI) model from [6].GSD's superiority is clear when it comes to goodness-of-fit testing for data from typical MQA experiments.Out of the three models tested (GSD, SLI and Ordered Probit), only the GSD properly describes response distributions observed in the data.If it comes to bootstrapping, the SLI, similarly to the GSD, outperforms the empirical distribution for small sample size of 12 and 24.However, GSD's improvement over the empirical distribution is larger for the two cases.Furthermore, only the GSD outperforms the empirical distribution for the small sample size of 50.

V. CONCLUSION
Our work substantiates the following four claims: 1) The GSD has interpretable parameters that clearly and intuitively describe response distribution shape (for responses gathered in MQA subjective experiments).
2) The GSD properly models response distributions observed in typical MQA subjective experiments.
3) The GSD is better suited for bootstrapping of responses from MQA subjective experiments in comparison to the traditional approach based on empirical distribution.4) The GSD outperforms the state-of-the-art model in terms of goodness-of-fit testing (on data from typical MQA experiments) and bootstrapping.The results indicate that the GSD-based bootstrapping of subjective responses from MQA experiments can be used to build new ML-based perceptual quality predictors, without having to organise large-scale controlled experiments.This gives a chance of building ML-based predictors cheaper than would be otherwise possible.
We hope that our discussion regarding interpretable GSD parameters and risks inherent to using continuous models to analyse discrete subjective responses, will convince the MQA research community to reconsider current best practices and recommendations.
There are two directions our future work may take.First, we would like to build a ML-based perceptual quality predictor.For this we plan to use the GSD-based bootstrapping.Second, we would like to propose a GSD-based parametric hypothesis testing framework for analysis of subjective responses from MQA experiments.
At last, we invite everyone to use the GSD to analyse subjective responses from their experiments and to make use of the tools presented in this paper.Our GitHub repository (https://github.com/Qub3k/subjective-exp-consistency-check)contains software tools that make it easier to start using the GSD.We hope the model and related tools will allow other MQA researchers and practitioners to analyse their data more efficiently and effectively.
where 0 log(0/(nq r k )) = 0. 6) Calculate bootstrap p-value using the following equation where I(x) is one if x is true or 0 if x is false.Naturally, the procedure can be applied to models other than the GSD.One has to replace the GSD estimator in steps 1) and 4), with an estimator appropriate for a model of interest.

APPENDIX C BOOTSTRAPPING EFFECTIVENESS TEST
This test indicates whether a given model better reflects the distribution shape of a large sample, when fitted to a subsample of this large sample.More specifically, the test checks how the model performs when compared to the empirical probability mass function (EPMF).Importantly, the procedure below assumes that we are operating on observations expressed on the 5-level ordinal scale (e.g., 5-level Likert scale).In the field of MQA this usually corresponds to the following five response categories: 1-Bad, 2-Poor, 3-Fair, 4-Good and 5-Excellent.
Let us denote by N the number of observations in the large sample (e.g., N = 200) and by n the number of observations in the subsample of this large sample (e.g., n = 24).Now, we denote by (N 1 , N 2 , N 3 , N 4 , N 5 ) the frequencies of each response category in the large sample.We denote by (p 1 , p 2 , p 3 , p 4 , p 5 ) the EPMF of the large sample.The test procedure is as follows: 1) Generate M C bootstrap samples (e.g., M C = 10 000) of size n from the EPMF of the large sample (p 1 , p 2 , p 3 , p 4 , p 5 ).2) For the r-th bootstrap sample (r = 1, 2, . . ., M C): a) Estimate response category probabilities using maximum likelihood estimation for the model of interest (e.g., the GSD model).Denote the estimated probabilities by (q 1 , q2 , q3 , q4 , q5 ).f) (Applies only when the GSD is compared to the EPMF.)Check if any of the two exceptions to the point above apply.i) If the bootstrap sample consists of observations in only one response category or in only two neighbouring response categories, we know for sure that the performance of the model and the empirical distribution is the same and so we set W r = 0 and move with the analysis to the subsequent bootstrap sample.ii) If the condition above is not met and for any response category k vk = 0 and, at the same time, N k = 0 then the model-based likelihood L m = 0 and the EPMF-based likelihood L e = 0.In such a situation W r = ∞.When implementing the test as a software programme we advise to assume that W r = 10 10 .
3) Calculate the estimator of p m − p e = P (W r > 0) − P (W r < 0), which is the difference between the probability that the model has greater likelihood than the EPMF and the probability that the EPMF has greater likelihood than the model.This can be formally described by the following.

A. Parameters Estimation Modification
In our analyses, we apply the above procedure using a modified probability estimators for the GSD, SLI and empirical distribution.We need such a modification, since if for a specific item we collected n answers, and none of them is category i, it does not mean that in a larger sample category i cannot be observed.It only means that with the number of answers under consideration, n in our case, the specific category, i in our example, was not observed.Therefore, concluding that the probability of observing category i is 0 should be modified to a positive value, smaller for larger n.Let us denote this value by (n).Significantly, one of the consequences of such modification is preventing any response category probability from being equal to 0.
It is an open question how to define (n).For the empirical distribution, we can simply add 0.5 to all response category counts (cf.To define ψ• (n) and ρ• (n) we introduce a limit for the maximum probability any two response categories can add up to (and call it p max ).Importantly, when assessing p max , we only take into account two most probable response categories.This can be formally written as follows:   The final algorithm for fitting the GSD to a sample is as follows.Find such ( ψ, ρ) that satisfy the following two criteria: 1) p max ≤ 1 − 1 n , where p max is given by ( 9) and n is the sample size, and 2) the likelihood function has the maximum value.An example of ψ and ρ ranges for different sample sizes is shown in Fig. 6.
For the SLI, it is sufficient to introduce a lower limit on σ.Specifically, if σ < c n , then σ = c n , where , with Q(p) being the quantile function of the standard normal distribution for probability p.This modification makes sure that the most probable response category gets at most 1/n of the probability mass.

Fig. 1 .
Fig. 1.Realisations (in the form of probability mass functions) of the GSD for a 5-point scale and various values of parameters ψ and ρ.Notice how the growing value of ρ corresponds to more responses accumulating close to the value of ψ.

Fig. 3 .
Fig. 3. p-Value histograms for the GSD (upper) and Ordered Probit (lower) models.p-Values come from the G-test of goodness-of-fit applied to stimuli from (a) typical Multimedia Quality Assessment (MQA) experiments, (b) typical and broadly understood MQA experiments and (c) non-MQA experiments.The thick vertical line marks the 0.05 significance level.

Fig. 4 .
Fig. 4. p-Value P-P plot for typical MQA experiments.p-Values come from the G-test of goodness-of-fit applied to the GSD, Ordered Probit and Simplified Li2020 (SLI) models, fitted to responses from typical MQA experiments.CDF stands for cumulative distribution function and ECDF for empirical cumulative distribution function.

Fig. 5 .
Fig. 5. Histograms depicting the distribution of probability differences pGSD − pe in (a) and pSLI − pe in (b).Three small sample sizes are considered: 12, 24 and 50.Blue-coloured parts of the bars represent statistically insignificant probability differences.Red bars with the hatching indicate the sum of probability differences to the right and to the left of zero (excluding insignificant results).
b) Denote by (v 1 , v2 , v3 , v4 , v5 ) the EPMF of the bootstrap sample.c) Find the likelihood L m of the estimated model for the large sample.In other words, calculate L m = 5 k=1,N k =0 qN k k .d) Find the likelihood L e of the bootstrap sample's EPMF for the large sample.In other words, calculate L e = 5 k=1,N k =0 vN k k .e) Find the natural logarithm of the ratio of the two likelihoods and denote it by W r W r = ln L m L e .Note that the above simplifies to W r = 5 k=1,N k =0 N k (ln qk − ln vk ) .
x) is one if x is true or 0 if x is false.4) Calculate .95confidence interval for p m − p e i.e., L = pm − pe − 1.96 pm + pe − (p m − pe ) 2 M C R = pm − pe + 1.96 pm + pe − (p m − pe ) 2 M C For L > 0 the model performs better.For R < 0 the EPMF performs better.If [L, R] contains zero there is no significant difference between the model and EPMF.

•
We indicate that the GSD well describes responses from typical MQA subjective experiments.•We show that the GSD outperforms empirical distribution when it comes to bootstrapping for responses from MQA subjective experiments.

TABLE I AN
OVERVIEW OF DATA SETS WE USE TO TEST THE GSD ON REAL DATA Exp. stands for experiments; av stands for audiovisual; mr stands for movie recommendation; bu stands for broadly understood.

TABLE II DISTRIBUTION
OF RESPONSES AMONG THE FOUR STUDIES USED IN THE BOOTSTRAP ANALYSIS.HDTV CORRESPONDS TO VQEG HDTV PHASE I STUDY.
Denote by (n 1 , n 2 , n 3 , n 4 , n 5 ) numbers of observed responses, i.e. n k is a number of answers k and By (p 1 , p 2 , p 3 , p 4 , p 5 ) denote unknown probabilities of the responses 1, . .., 5. We want to test H 0 : (p 1 , p 2 , p 3 , p 4 , p 5 ) are from the GSD against H 1 : (p 1 , p 2 , p 3 , p 4 , p 5 ) are not from the GSD.One should not use the chi-squared test in case of small numbers in selected cells, i.e. small n k for some k ∈ {1, . . ., 5}.