Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research

A growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of the relative occurrence of a particular class may be affected by misclassification bias, thereby affecting the value of the calculated social indicator. Although a significant amount of studies have investigated misclassification bias correction techniques, they commonly rely on a set of assumptions that are likely to be violated in practice, which calls into question the effectiveness of these methods. Thus, there is a knowledge gap with respect to the assessment of misclassification bias’s impact on a specific social indicator formula without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, as was done in existing literature, but also using correlation metrics. In this paper, we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators in terms of regression and correlation metrics. The proposed approach focuses on indicators calculated based on the distribution of classes and can process any number of classes. The proposed approach allows selecting the most appropriate classification model for a particular social indicator, and vice versa. Moreover, it allows for assessment of the optimistic level of correlation between the indicator calculated based on the results of the classification algorithm and the true underlying indicator.


I. INTRODUCTION
Many studies in the social sciences are presently examining the potential of machine learning (ML) algorithms [1], forming computational social science-the academic sub-discipline concerned with computational approaches to the social sciences. Digital trace data are of special interest in the context of ML-based analysis, as the huge volume of data makes it a significant challenge to analyze it manually. According to Howison et al. [2], digital trace data are found (rather than produced for research), event-based (rather than summary data), and longitudinal (since events occur over a The associate editor coordinating the review of this manuscript and approving it for publication was Claudia Raibulet . period of time) data that are both produced through and stored by an information system. These characteristics make digital traces an ideal source for building social indicators defined by Ferriss [3] as statistical time series ''used to monitor the social system, helping to identify changes and to guide intervention to alter the course of social change.'' A typical example of social indicators constructed using ML analysis is the estimation of subjective well-being (SWB) based on user-generated content from social media, by employing an ML model trained to classify the sentiment of posts [4], [5]. From a practical point of view, a classification algorithm is commonly used to classify digital traces to the classes of interest; then, based on the distribution of these classes, an indicator is calculated for the entire population [6].
However, as long as the ML algorithm's predictions are not completely error-free (we will hereinafter refer to this phenomenon as misclassification bias), the estimate of the relative occurrence of a particular class can be biased [7]- [9].
The key issue here is that optimal individual digital trace classification can lead to biased estimates of the digital trace class proportions and, subsequently, biased estimation of a social indicator. Generally accepted success criteria for classification, such as accuracy and F-measure on a test dataset [10], are appropriate for individual-level classification but can be seriously misleading when characterizing document populations or dynamic within populations [11]. For example, Zunic et al. [4] conducted a survey on sentiment analysis in health and well-being studies and found that the average classification accuracy for sentiment analysis was around 80%. It can be considered an acceptable classification performance for sentiment analysis, but suppose that all the misclassified objects were in a particular direction for one or more class. In that case, the statistical bias in using this method to estimate the aggregate quantities of interest could be as high as 20 percentage points. Moreover, if we take into account that, furthermore, the target social indicator is somehow calculated on the basis of the obtained proportions, then the deviation of the calculated indicator from the true indicator value may remain unchanged or change both up and down, adding another degree of uncertainty. As was highlighted in the measurement error studies [12] by the US Department of Education, all data collections errors, including misclassification errors, affect the final value of the calculated indicator depending on the specific formula of interest. Researchers have repeatedly reported cases where, because of errors in the data, including incorrect classification, the research results contained data that did not fully correspond to the real state of affairs. For example, Wolff et al. [13] examined data error in health, education, and income statistics used to construct the Human Development Index and found that up to 34% of countries were misclassified. By replicating prior studies, the authors showed that key estimated parameters varied by up to 100% due to data errors. Other papers also indicated errors in aggregate statistical data for suicide [14], disability [15], mortality [16], and life satisfaction [17]. Thus, despite the fact that many studies have examined methods of correcting classification bias [9], [11], we can conclude based on previously mentioned cases that it is essential to analyze this bias together with a mathematical formula 1 for calculating the indicator for social indicators research. Also, even though the influence of the classification bias on the classification results has been indicated in the literature, to the best of our 1 As an example of a mathematical formula, we can consider the formula for educational attainment of 30-34 year olds described in the ETF Manual on the Use of Indicators [18]. Educational attainment refers to the highest educational level achieved by individuals expressed as a percentage of all persons in that age group. Mathematically, the formula is defined as follows. population 30-34 years old with tertiary education total population 30-34 years old * 100% . knowledge, the impact of the misclassification bias on the calculated social indicators has not yet been assessed. This paper provides a simulation approach for estimation of the impact of misclassification bias on the calculated social indicators. In this paper, we considered only indicators calculated based on the distribution of classes without any restrictions on the number of classes. The contributions of this study is five-fold.
• We propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators, which can be used for the following purposes: --Selecting the most appropriate classification model for a particular social indicator and vice versa. --Assessing the level of correlation between the indicator calculated based on the results of classification algorithm and true underlying indicator being inherited.
• Within the proposed simulation approach, we also define a formal model of online social data for social indicators research, which can be further used by academics.
• Within the proposed simulation approach, we also propose a method for approximation of predicted indicator based on the algorithm's confusion matrix and true indicator.
• Within the proposed simulation approach, we also propose a method for aggregation and interpretation of multiple correlation coefficients with p-values.
• We provide illustrative application examples of the proposed simulation approach and making conclusions based on the simulation outcomes. Considering that the proposed simulation approach relies on a series of assumptions-as defined further (see assumptions 1, 2, and 3)-that can be violated in practice, the outcomes of the approach for real-life studies should be considered as naive and optimistic. However, we believe that this study contributes to the body of knowledge on computational social sciences and lays the foundation for future research on the impact of misclassification bias on calculated social indicators.
This article is organized as follows. In Section II, we provide a brief overview of simulation modeling. In Section III, we describe the literature analysis and indicate the knowledge gap. In Section IV, we propose a model for social indicators research based on digital traces. In Section V, we propose a simulation approach for assessing the impact of misclassification bias on social indicators research. In Section VI, we provide an illustrative example of applying the proposed approach to synthetic and real-life classification algorithms. Finally, in Section VII, we draw conclusions and suggest future research directions.

II. BACKGROUND
Simulation modeling is a special kind of mathematical modeling in which the system under study is replaced by a model describing the real system with sufficient accuracy, with VOLUME 10, 2022 which further experiments are conducted to obtain information about this system. In other words, a simulation model attempts to approximate a system's behavior and development over time by running a model [19]. Simulation models tend to be simplified abstractions of the system being modeled, the purpose of which is to capture a certain level of detail necessary to achieve the objectives of the study [20]. Simulation modeling is commonly used in such cases when a real system cannot be engaged, the analytical description cannot be formulated, or creating an analytical model is fundamentally impossible. In a broader sense, computer simulation attempts to approximate the behavior of a system and its development over time by implementing and running a computer simulation model. By changing the conditions and variables in the implemented simulation model, researchers can make predictions about the behavior of the simulated system without having to implement the entire system. Computer simulations are commonly used when performing system emulation is challenging or when it is necessary to emulate a system as part of more complex environment [20].
The literature has already described many examples of simulation models for systems of varying complexity [21]- [25], with which experiments are carried out to obtain information about these systems. In each study, the performance metrics of a simulation model were deeply related to the system being simulated and the goals of the simulation. For example, Gunal and Pidd [21] simulated the Accident and Emergency (A&E) Department at UK Hospitals and defined a performance measure as the percentage of patients who stayed in A&E more than 4 hours. Memon et al. [23] simulated blockchain systems and defined performance measures as the number of transactions per block, mining time of each block, system throughput, memorypool count, waiting time in memorypool, number of unconfirmed transactions in the whole system, total number of transactions, and number of generated blocks. Chan and Zhang [24] simulated a supply chain and defined a performance measure as the retailer's total cost. Thus, when creating a simulation model, it is also necessary to determine the key objectives to be achieved and the metrics to be obtained.

III. RELATED WORK
Correct classification of individuals, values, and attributes is an essential element of any study. Misclassification occurs when an individual, a value, or an attribute is assigned to a category other than that to which it should be assigned. This erroneous classification can lead to incorrect associations being observed between the assigned categories and the outcomes of interest [26], thereby biasing inferences drawn from the data collected [27], often substantially [28], or decreasing the power of the study [29]. As highlighted by Kloos et al. [9], misclassification bias occurs in a broad range of applications, including epidemiology [30], political science [31], and official statistics [32]. The objective of these applications is to shift focus from minimizing loss functions at the level of individual predictions to the level of aggregated predictions. In the context of ML, this objective is studied under quantification learning. Quantification learning aims to provide an aggregate estimation for unseen data by applying a model trained using a training dataset with a different data distribution [33]. However, there are certain drawbacks associated with the use of quantification learning in real-life studies. Firstly, the researchers note that since quantification learning is at an early stage of development, a more comprehensive theoretical analysis is required to better formulate both behavior of these algorithms and the learning objective in general [33], [34]. Secondly, although most of the efforts have focused on tackling binary quantification, quantification for more than two classes remains under-explored [33]. This is crucial for real-life applications: in a significant amount of cases, there are more than two classes of interest. Lastly, there is a lack of proper benchmark datasets for quantification [33]. Quantification studies require relatively large training and test datasets to train the model and obtain meaningful results and conclusions [9], [33]. In the studies mentioned above, data annotation tends to be expensive, and therefore there are little ready-made annotated data. Thus, much work remains to be done by the scientific community to freely apply quantitative learning to real-life research.
At the same time, a growing body of literature has investigated methods to reduce misclassification bias when aggregating categorical data from the level of individual predictions. For example, Hopkins et al. [11] explored new methods of automated content analysis designed to estimate the primary quantity of interest in many social science applications. As a part of the research, they also highlighted that misclassification bias may significantly affect the distribution of predicted classes. The authors proposed a classify-andcount method that gives approximately unbiased estimates of category proportions based on misclassification probabilities for each class-adjusting the distribution of predicted classes by confusion matrix normalized over rows. However, as shown later by Kloos et al. [9], the classify-and-count estimator is still (strongly) biased, so its application does not provide an unbiased estimate. In their paper, the authors also studied five existing estimation techniques to reduce the misclassification bias of binary classification algorithms. These methods for misclassification bias correction are commonly based on the assumption that misclassifications are independent across objects and that their probabilities are the same for each object. This assumption is often violated in practice, as misclassification in ML models is not random but tied to some separate groups of objects that are difficult for the model to separate from each other. This violation was partially confirmed in the study by Soroka et al. [35], which identified that different sentiment lexicons capture different underlying phenomena and highlighted ''the importance of tailoring lexicons to domains to improve construct validity.'' As a consequence, the use of such methods can lead to the bias caused by misclassification being replaced by another bias caused by the application of the correction method. As Armstrong [36] mentioned, these methods can be complicated to use, however, and should be used cautiously because ''correction'' can magnify confounding if it is present. Moreover, although the majority of these methods focus on misclassification bias correction for two classes, real-life studies are commonly interested in more than two classes.
However, there has thus far been little discussion about the impact of misclassification bias on the specific social indicator formula rather than class proportions. We argue that these objectives should be treated separately, since not all classes may be taken into account in the target formula, and misclassifications between certain classes may compensate for each other to some extent. An absolutely accurate assessment of the quality of a social indicator calculated based on the results of automatic classification (hereinafter referred to as predicted indicator) is possible only if there is access to the true underlying value of the indicator (hereinafter referred to as true indicator), obtained using a completely correct classification approach. Manual data labeling is now considered the benchmark in ML, and in almost all cases, researchers try to train their models to classify data just as accurately. Thus, classes obtained using manual annotation with high-quality guidelines 2 and a high inter-rater agreement can be considered as practically the only 3 source of completely correct classification. However, since in the computational social sciences, the automatic analysis of a huge amount of data is of great interest, it is not only extremely difficult and time-consuming to annotate all the data manually in practice, but in the first place it is also extremely expensive. But even with enough annotated data, researchers may experience model overfitting when training a model. Overfitting is a fundamental challenge in supervised ML that prevents researchers from perfectly generalizing the models to sufficiently fit observed data on training data and unseen data on the testing set [38]. Due to the presence of overfitting, the model tends to work perfectly on the training set but does not fit well on the test set. Core to model training are appropriate ways for data splitting or resampling: the final training parameters we must be chosen only after evaluating a number tuning parameters via data splitting or resampling [39]. In particular, cross-validation is one of the most widely used data resampling methods to estimate the true prediction error of models and to tune model parameters, thereby preventing model overfitting [40]. The cross-validation procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, this procedure is often referred to as k-fold cross-validation. Various other strategies for addressing overfitting can be found in the recent survey papers [38], [41], [42] on that topic. It should also be noted that existing misclassification bias correction methods use primarily regression quality metrics to assess their performance, such as mean absolute error (MAE) or mean squared error (MSE). However, given the erroneous nature of the existing algorithms for automatic classification, interest for research may not be so much the absolute correspondence of the index calculated based on the predictions to the true underlying one, but rather their correlation. In this case, based on the results of the study, it will not be entirely correct to draw conclusions about the absolute values of the indicator, but it will be possible to analyze its changes over time. Thus, there is a knowledge gap regarding the assessment of the impact of misclassification bias on a specific social indicator formula 4 without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, but also with correlation metrics. Based on these findings, we propose a simulation approach for assessing the impact of misclassification error on a particular social indicators formula, given the algorithm classification performance and the information about data available for analysis.

IV. MODEL
In this Section, we propose a model for social indicators research based on digital traces. We applied classical set theory to develop our model, as recent literature [43], [44] articulated a series of its advantages in the case of computational social sciences.

A. CONCEPTUAL MODEL
The Online Social Data Model for Social Indicators Research consists of three elements: Digital Traces, Classification, and Indicators. The Digital Traces represent the source data found for the analysis, which are event-based and longitudinal, thus suitable for construction social indicators. The Classification represent the automated approaches for digital trace object classification based on ML methods. The Indicators represent a methodology for calculating social indicators based on classification results and estimating its quality. When constructing the model, we assumed that the source digital traces for the analysis are representative of the general population, so no additional sampling methods should be applied. As a consequence, all information about individuals can be omitted.
overlapping time intervals such as ti i < ti i+1 , trace objects, and is a partial function mapping time intervals to mutually disjoint non-empty subsets of digital traces created in that time interval. A subset of digital trace objects created in the time interval ti i will be hereinafter referred to as X ti i , i.e. → interval (ti i ) = X ti i . The number of items in a subset X ti i will be hereinafter referred to as N ti i ∈ N.
The Classification of the Online Social Data Model for Social Indicators Research is defined as a • QM = {qm 1 , qm 2 , . . . , qm U } is a set of U target quality measures where each item represents a function qm i : aggregated target quality measures where each i-th item represents an aggregation function suitable for qm i and defined as aqm i : (R L ) V → R H , where L ∈ N is the number calculated target quality measures to be aggregated and H ∈ N is the size of the vector representing the aggregated target quality measure. A particular target quality measure function qm i can be represented in a variety of ways depending on the needs of the particular social indicators research-for example, MAE for identifying deviation of the approximated indicator from the true indicator, or the Pearson correlation coefficient (Pearson's r) for identifying correlation between those indicators. As a consequence, it returns a vector of real numbers since difference quality measure may return different number of values (e.g., it may contain one value in the case of MSE and two values representing the confidence interval in the case of Pearson's r). The aggregated target quality measure aqm i returns a vector of H real numbers because the aggregation approach may vary depending on the target quality measure qm i and specifics of the research (e.g., macro-averaging can be applied for MAE, resulting in a one-dimensional vector, and the confidence interval can be calculated based on Fisher z-transformation for Pearson's r, resulting in a two-dimensional vector).
TSI calculated based on mapping function f T will be further referred to as TSI T . TSI calculated based on the algorithm f P will be referred to as TSI P .

C. PROBLEM STATEMENT
In terms of defined notations, the problem statement for the estimation of the impact of misclassification bias on the calculated social indicators can be defined in the following way. Given a trained classification model f P and its error matrix on a test dataset CM , data for analysis X , an indicator calculation formula I , and formulas for the target quality metric qm i and aggregated target quality metric aqm i , it is necessary to estimate the classification bias AQ m .

V. SIMULATION APPROACH FOR ASSESSING THE IMPACT OF MISCLASSIFICATION BIAS ON SOCIAL INDICATORS RESEARCH
As mentioned earlier, the assessment of the impact of misclassification bias on calculated social indicators is possible only if there is access to the true value of the indicator, obtained using a completely correct classification approach. In our approach, we propose to simulate the true indicator, then, on its basis, approximate the results of the classification algorithm, and then calculate the quality metrics. Formally, the proposed approach consists of three steps.
1) Simulate the true indicator TSI T by simulating true mapping function f T . 2) Approximate the predicted indicator TSI P by approx- imating an algorithm f p based on the true mapping function f T . 3) Calculate the quality qm i of the predicted indicator TSI P for multiple simulations, and then calculate the aggregated quality score aqm i . The proposed approach is based on the following assumptions.
Assumption 1: The training data for the classification model was labeled manually using high-quality guidelines, and the annotators demonstrated a high inter-rater agreement score. Consequently, we can consider that all digital traces in the training dataset were assigned with class labels that completely match the true underlying parameter being annotated.
Assumption 2: The classification model was trained on the training data representative of the digital traces available for analysis.
This assumption in combination with assumption 1 allows us to consider that class distribution in the digital traces available for analysis is equal to class distribution in the training dataset.
Assumption 3: (Mis)classifications are independent across objects, and the (mis)classification probabilities are the same for each object, conditional on their true class label. Consequently, we could use a confusion matrix for approximating the predicted index based on true index using the inverse classify-and-count approach [11] for misclassification bias correction.
Taking into account the simulation and approximation nature of the proposed approach, as well a set of applied assumptions that can be violated in practice, the outcomes of the approach for real-life studies should be considered as naive and optimistic. In other words, we recommend considering the outcomes as an optimistic assessment representing the best case in real-life studies, provided it is not possible to prove the fulfillment of all assumptions.

A. TRUE INDICATOR SIMULATION
Since both true and predicted indicators are calculated based on the number of objects mapped to specific classes for each time interval, the simulated data for each time interval are a vector with a dimension equal to the number of classes, and it is defined as follows: Also, the simulated data can be presented as a time series where each element scd ti i ,y j represents the number of digital traces contained in time interval ti i and labeled as a class y j . Since the true indicator is unknown, we propose to synthetically generate the number of objects of each class for each analyzed time interval and calculate the true indicator TSI T based on the generated data. Considering that the distribution in the digital traces available for analysis is equal to class distribution in the training dataset (see assumption 2), we can expect the simulated data to satisfy the following condition: where cm y i ,y j is the number of objects with true class y i classified as y j , as further defined in Eq. (4). At the same time, we do not expect class distribution for a specified time interval to be equal to the class distribution in the training dataset, since according to assumption 2, the training dataset is representative of the whole set of data available for the analysis but not necessarily of a particular slice of these data. In essence, this means that we simulate behavior of the mapping function f T . For each time interval, the total number of generated objects should not exceed the number of objects contained in the data for analysis during the same time interval. SCD ti i generated for the calculation of true indicator will be hereinafter referred to as SCD T ,ti i .

B. PREDICTED INDICATOR APPROXIMATION
Once the true mapping function is defined and the true indicator is calculated, we must define an algorithm approximating true mapping function (i.e., classification model) f P . In other words, we need to correct misclassification bias. To begin with, we need estimates of the algorithm's (mis)classification probabilities. Following [45], we assume that misclassifications are independent across objects and that the (mis)classification probabilities are the same for each object, conditional on their true class label. The (mis)classification probabilities for each class are estimated via confusion matrix normalized over true classes, which is calculated based on a confusion matrix CM . After that, we must adjust the true classes distribution SCD T ,ti i by (mis)classification probabilities to acquire the approximate predicted classes distribution. A similar approach has been widely used in the literature [9], [11] but only for the inverse problem-to correct the classification bias from the already predicted classes.
The confusion matrix can be presented as where each row of the matrix represents the instances in an actual class, and each column represents the instances in a predicted class. An asterisk refers to whole rows or columns in a matrix. For example, cm i, * refers to the i-th row of CM , and cm * ,j refers to the j-th column of CM .
cm y i ,y * = cm y i ,y 1 cm y i ,y 2 · · · cm y i ,y M .
cm y * ,y j = cm y 1 ,y j cm y 2 ,y j · · · cm y M ,y j T . (6) A confusion matrix normalized over true classes can be further calculated as follows: Assuming that our model is unbiased toward a specific type of errors (i.e., the probability of a model to make a error is distributed randomly) and always follows a given confusion matrix CM , we can approximate the non-normalized confusion matrix of our model for a simulated data as follows: Note that the normalized confusion matrix CM nte operates with R, whereas the non-normalized confusion matrix operates with N , so it is necessary to round the results of matrix multiplication and randomly adjust them to meet the target class distribution if necessary. The simulated distribution of predicted classes based on simulated true classes distribution SCD T ,ti i for a given time interval ti i is as follows: Finally, we can calculate Y N ti i (i.e., a set of N ti i classified objects created in time interval ti i to an indicator value) based on obtained class distributions SCD T ,ti i and SCD P,ti i for further calculation of the true indicator and predicted indicator, respectively. Since the order of the items in Y N ti i is not important, we can define the order of items in any way following our class distributions. After that, we can calculate TSI T , TSI P , and qm i . By repeating the entire procedure multiple times, 5 we can obtain multiple qm i and calculate aqm i . However, if for such metrics as MAE and MSE the aggregation methods are well defined (for example, it can be a simple average value), then the correlation aggregation tends to be a more challenging task to accomplish. Note that in the case of correlation analysis of time series, it is important to check that these time series are stationary, and if they are 5 Determining the number of required simulation runs lies outside the scope of this paper. For more information on this topic, please refer to [46]. not, then apply some technique to make them stationary (e.g., differencing). Moreover, in this section we assumed that the analyzed time series are stationary.
Furthermore, the method of aggregating Pearson's or Spearman's correlation coefficients is presented. This method consists of two parts: aggregation of correlation coefficients and aggregation of p-values.

1) AGGREGATION OF CORRELATION COEFFICIENTS
For combining Pearson or Spearman correlations, we propose to use the aggregation method based on Fisher z-transformation described in [47]. Let p i denote an estimate of the Pearson or Spearman correlation in study i and n i denote the number of observations in study i. An estimate of the average study population correlation is defined as follows.
An estimate of the variance is defined as follows.
where var(p i ) is the variance for a particular correlation. Variance for Pearson correlation is defined as follows.
Variance for Spearman correlation is defined as follows.
An approximate two-sided (1 − α)% confidence interval for the average study population correlation is where z is the 1 − α 2 quantile of a standard normal distribution (i.e., the probit) corresponding to the target error rate α. This method should use all Spearman correlations or all Pearson correlations because these correlations are not comparable [47].

2) AGGREGATION OF P-VALUES
According to Heard and Rubin-Delanchy [48], there are six most fundamental or commonly used statistics for combining p-values: Fisher's method ( [53], where is the standard normal cumulative distribution function, and Tippett's method (ST = min(p 1 , p 2 , . . . , p n )) [54]. Although each method is optimal in some setting, all of them can be considered as strict methods tending to reject a hypothesis if even a very small part of the tests did not show the specified level of statistical significance. Given that the approximation algorithm is not error-free and simulations can be repeated an infinite number of times, it is extremely likely that in certain cases the confidence interval for a particular simulation iteration may not be statistically significant. Thus, to interpret the results of this work, it is necessary to formulate a softer approach to the aggregation of n-values, which will take into account the not error-free nature of the classification algorithm. As a softer method of aggregation, we propose to use a probabilistic approach and focus not on the absolute values of the calculated p-values but on whether they satisfy a predefined level of significance-for example, less than 0.05. In this case, the list of absolute p-values can be converted into a list of 0 (not satisfying) and 1 (satisfying) and then perceived as a binomial distribution. P = {p 1 ,p 2 , . . . ,p n } = {p 1 < α, p 2 < α, . . . , p n < α}. (15) For the obtained binomial distributionP, we can calculate a binomial proportion confidence interval and consider it as our soft approximation. We can further interpret it as a confidence interval of obtaining statistically significant results with a predefined confidence level at least as extreme as the observed results of a statistical hypothesis test. The resulting formula is defined as follows: where n = n(P) is the total number of experiments, n S = n({p i = 1|∀p i ∈P}) is the number of successes, n F = n({p i = 0|∀p i ∈P}) = n − n S is the number of failures, and z is the 1 − α 2 quantile of a standard normal distribution (i.e., the probit) corresponding to the target error rate α. Sincep i = 0 is considered as a failure andp i = 1 is considered as a success, the bound of the confidence interval of aggregated p-values CI p will be 1 in case p i < α and 0 in case p i ≥ α. For example, if all bound of CI p are greater than 0.95 (considering α = 0.05), then the aggregated correlation CI corr is statistically significant.

3) INTERPRETATION OF RESULTS
If upper and lower bounds of aggregated p-values CI p are high (generally more than 0.95), then the correlation CI corr is statistically significant, so we can use the calculated Pearson's or Spearman's coefficient. Several authors have offered guidelines for the interpretation of a correlation coefficient, so we could use the most appropriate for a particular studyfor example, guidelines by Zou et al. [55]. Strength of the correlation should be defined by the lower bound for positive correlation and by the upper bound for negative correlation. Depending on the strength of the correlation, we could make the following conclusions.
• If CI corr is perfect, then we can confirm that there is no impact of the misclassification bias on the calculation of the indicator, allowing us to achieve the perfect level of correlation between predicted and true indicators.
• If CI corr is strong, then we can confirm that there is a weak impact of the misclassification bias on the calculation of the indicator, allowing us to achieve a strong level of correlation between predicted and true indicators.
• If CI corr is moderate, then we can confirm that there is a moderate impact of the misclassification bias on the calculation of the indicator, allowing us to achieve a moderate level of correlation between predicted and true indicators.
• If CI corr is weak, then we can confirm that there is a strong impact of the misclassification bias on the calculation of the indicator, allowing us to achieve the weak level of correlation between predicted and true indicators.
• If CI corr is absent, then we can confirm that there is a perfect impact of the misclassification bias on the calculation of the indicator, allowing us to achieve no correlation between predicted and true indicators. However, considering that assumption 3 is commonly not satisfied in practice (or it is extremely difficult to prove that it is satisfied for a certain case), in real-life studies it would be more correct to use the obtained conclusion as the best case-that is, the lower estimation of the potential impact of the classification bias on the social indicators research.
If the upper or the lower bound of aggregated p-values is not high (generally less than 0.95), then the correlation is not statistically significant (it might have happened just by chance) and we should not rely upon the Pearson's or Spearman's coefficient. In other words, we cannot confirm that there is a correlation between predicted and true indicators, and, consequently, we cannot recommend this algorithm for calculating the indicator based on available data.

VI. ILLUSTRATIVE EXAMPLE: SUBJECTIVE WELL-BEING
Let us say that the research domain is SWB and the research aim is to construct a SWB indicator based on sentiment analysis of posts from social networks. Then, the proposed simulation approach can be employed for the assessment of the impact of misclassification bias on social indicators calculations. Let us also assume that we are interested in three SWB indicators, calculated based on sentiment classification of 1,000,000 posts distributed between 36 time intervals into three classes: negative, neutral, and positive. The class distribution is drawn from a real-life dataset of social media posts, RuSentiment [56]: 11.65% negative posts, 64.13% neutral posts, and 24.22% positive posts. The indicators of interest are defined as follows.
1) SWB P2E represents the share of expressed positive emotions relative to all emotions and is defined as where POS, NEG, and NEU are numbers of posts classified as positive, negative, and neutral, respectively. We assume this SWB indicator as an approximation of VOLUME 10, 2022 positive affect, which is defined by ''the extent to which a person feels enthusiastic, active, and alert'' [57]. 2) SWB P2PN represents the share of expressed positive emotions relative to the sum of positive and negative emotions and is defined as We also assume this SWB indicator as one of the possible approximations of positive affect, where the influence of neutral sentiment is not taken into account. 3) SWB P−N 2E represents the difference between positive and negative emotions divided by the number of all emotions.
We assume this SWB indicator as an approximation of life satisfaction, which takes into account the difference between positive and negative emotions in relation to the total number of emotions. For the generation of synthetic time series, we applied the nonlinear autoregressive moving average model from the TimeSynth [58] library with random hyperparameters for each simulation run. We selected five different approaches with different classification algorithms. For each algorithm, we provided confusion matrices for classes Y = {negative, neutral, positive} as required by the proposed simulation approach. We chose Pearson's correlation coefficient as the main metric and MAE and MSE as secondary metrics. Since the generated synthetic time series are not stationary, we differentiated the series before calculating the correlation coefficient. For each calculated indicator, we ran 50,000 6 simulation iterations. We performed all the calculations on the supercomputer facilities at HSE University. The university HPC cluster occupies seventh place in rating the most powerful computers of the CIS TOP50 and helps to solve ML problems, population genomics, hydrodynamics, atomistic and continuous modeling in physics, generative probabilistic models, financial row forecasting algorithms, and other actual problems [59]. However, having access to a supercomputer is not a prerequisite for applying this simulation approach. Calculations can be carried out both on a personal computer or in cloud services-for example, Google Colab. We used CPU nodes of the supercomputer for faster development iterations.

A. RANDOM CLASSIFICATION ALGORITHM
The random algorithm randomly assigns sentiment classes to posts, thereby representing the worst classification quality. The normalized confusion matrix for this algorithm is defined as follows.
According to the results of the simulation, the predicted index is a constant (see Fig. 1), so the Pearson's correlation coefficient (see Table 1) is undefined for all indicators because it has variance equal to zero. Thus, because the correlation is undefined, we cannot confirm that there is a correlation between predicted and true indicators. Consequently, we cannot recommend using this algorithm for calculating the indicator based on available data.

B. POOR CLASSIFICATION ALGORITHM
The poor algorithm classifies all objects with a high level of errors. The normalized confusion matrix for this algorithm is defined as follows.  According to simulation results, the aggregated p-values are lower than 0.95 (see Table 1), so the correlation is not statistically significant and we should not rely upon the correlation coefficient. In other words, we cannot confirm that there is a correlation between predicted and true indicators. Consequently, we cannot recommend using this algorithm for calculating the indicator based on available data (see Fig. 1). VOLUME 10, 2022

C. BASIC CLASSIFICATION ALGORITHM
The basic classification algorithm is a multinomial logistic regression (MLR) method, a common baseline approach for sentiment analysis task. As a real-life example, we selected an MLR model presented by Ismail et al. [60]. According to their paper, the normalized confusion matrix for their classification model is defined as follows.
According to the simulation results, the aggregated p-values are higher than 0.95 (see Table 1), so we can make the following conclusions from the given assumptions, data, algorithm, and conditions under consideration.
• We can confirm that there is a negligible impact of the misclassification bias on the calculation of SWB P2E , allowing us to achieve an almost perfect level of correlation between the predicted and true indicators.
• We can confirm that there is a moderate impact of the misclassification bias on the calculation of SWB P2PN , allowing us to achieve a moderate level of correlation between the predicted and true indicators.
• We can confirm that there is a weak impact of the misclassification bias on the calculation of SWB P−N 2E , allowing us to achieve a strong level of correlation between the predicted and true indicators. However, considering the particulars of assumption 3 mentioned above, in real-life studies it would be more correct to use obtained conclusions as estimations for the best cases.

D. ADVANCED CLASSIFICATION ALGORITHM
The third classification algorithm is based on one of the most recent advances in natural language processing, a pre-trained language model. As a real-life example, we selected one of our previously developed models, ruBert-FiT-RuReviews [61], which achieved state-of-the-art results on the RuReviews dataset [62]. According to the paper, the normalized confusion matrix for this classification model is defined as follows.
According to the simulation results, the aggregated p-values are higher than 0.95 (see Table 1), so we can make the following conclusions for the given assumptions, data, algorithm, and conditions under consideration.
• We can confirm that the there is a negligible impact of the misclassification bias on the calculation of SWB P2E and SWB P−N 2E , allowing us to achieve an almost perfect level of correlation between the predicted and true indicators.
• We can confirm that there is a weak impact of the misclassification bias on the calculation of SWB P2PN , allowing us to achieve a strong level of correlation between the predicted and true indicators. However, considering the particulars of assumption 3 mentioned above, in real-life studies it would be more correct to use obtained conclusions as estimations for the best cases.

E. PERFECT CLASSIFICATION ALGORITHM
The fourth classification algorithm is the perfect algorithm, which correctly classifies all objects. The normalized confusion matrix for this algorithm is defined as follows.
According to the simulation results, the aggregated p-values are higher than 0.95 (see Table 1), so we can make the following conclusions for the given assumptions, data, algorithm, and conditions under consideration. We can confirm that there is no impact of the misclassification bias on the calculation of the SWB P2E , SWB P2PN and SWB P−N 2E , allowing us to achieve the perfect level of correlation between predicted and true indicators. However, considering the particulars of the assumption 3 mentioned above, in real-life studies it would be more correct to use obtained conclusions as estimations for the best cases.

VII. CONCLUSION
In this paper we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators. We considered only (1) indicators calculated based on the distribution of classes and (2) the case of multiclass classification. As mentioned in earlier, the contributions of this study are five-fold. Firstly, we proposed a simulation approach for assessing the impact of classification bias on the calculated social indicators. This approach can be used for selecting the most appropriate classification model for a particular social indicator and vise versa, as well as assessing the level of correlation between the true and predicted indicators. Secondly, we defined a formal model of online social data for social indicators research, which can be used further by academics. Thirdly, we proposed a method for approximation of predicted indicator based on the algorithm's confusion matrix and true indicator. Fourthly, we proposed a method for aggregation and interpretation of multiple correlation coefficients with p-values. Lastly, we provided illustrative examples of applying the proposed simulation approach and making conclusions based on the simulation outcomes. Considering that the assumptions used in our model can be violated in practice, the outcomes of the approach for real-life studies should be considered as naive and optimistic. However, we believe that this study contributes to the body of knowledge of computational social sciences and lays the foundation for future research on the impact of misclassification bias on calculated social indicators.
Future research directions on the current topic are therefore recommended.
• Based on the assumption 2, we expected in Eq. (3) that class distribution in a representative training dataset is equal to the distribution in the data for analysis. For a more accurate assessment, it is possible to consider the class distribution in the training dataset not as a fixed ratios but as an interval of possible values. For example, we can consider calculating binomial proportion confidence interval for each class distribution and simulate true index based on these intervals.
• Depending on the particular formula of a social indicator, different types of errors may affect the resulting value of an indicator in a different way. For example, mutually correcting errors (i.e., misclassification errors that correct each other during the indicator calculation) can negatively affect individual-level classification quality and at the same time have no impact on the calculated indicator. A more detailed study of different types of errors and their influence on the calculated indicator may allow researchers to develop a more comprehensive strategy for training classification models.
• Based on assumption 1, we expect that all objects in the training dataset were assigned with class labels that completely match the true underlying parameter being annotated. However, given that quite often annotators may disagree with each other, especially when working with subjective concepts, it is logical to suppose that for a model to have the same markup quality as a human, it is not necessary to have 100% accuracy on the test subset. Thus, further research can focus on what quality of classification on a given dataset can be considered equivalent to that of a human, given the metrics of inter-rater agreement on a given dataset.