A Robust Consistency Model of Crowd Workers in Text Labeling Tasks

Crowdsourcing is a popular human-based model to acquire labeled data. Despite its ability to generate huge amounts of labelled data at moderate costs, it is susceptible to low quality labels. This can happen through unintentional or intentional errors by the crowd workers. Consistency is an important attribute of reliability. It is a practical metric that evaluates a crowd workers’ reliability based on their ability to conform to themselves by yielding the same output when repeatedly given a particular input. Consistency has not yet been sufficiently explored in the literature. In this work, we propose a novel consistency model based on the pairwise comparisons method. We apply this model on unpaid workers. We measure the workers’ consistency on tasks of labeling political text-based claims and study the effects of different duplicate task characteristics on their consistency. Our results show that the proposed model outperforms the current state-of-the-art models in terms of accuracy.


I. INTRODUCTION
Crowdsourcing has an open collaborative nature with high availability of ordinary Internet users (crowd workers) [1]. This enable crowdsourcing to provide economical microlabeling solutions [2]. For example, text labeling of computational linguistics costs $1 million dollar for million label compared to $380k-$430k dollar when leveraging a crowdsourcing platform [3]. Therefore, many researchers resort to crowdsourcing as a labeling choice. Consequently, their research incorporate with the crowdsourcing, for example, responding to Covid-19 pandemic [4] and disasters [5], detecting fake news [6], and deep learning applications [7], [8]. One major issue in crowdsourcing is quality control [9], [10]. This issue is rooted in the human-based nature of crowdsourcing [11]- [13]. Reliability is one quality concern that examine the crowd workers' trustworthiness. The crowd workers can be unintentionally ill-qualified [13], or they may give incorrect answers intentionally to increase their income. The identification of reliable workers The associate editor coordinating the review of this manuscript and approving it for publication was Zhenyu Zhou .
is hence a key issue in any crowdsourcing system. This identification is commonly achieved by evaluating worker output using a gold standard [14]- [16] and by using consensus methods such as majority voting [17]- [20]. Other reliability measurements include worker-based ones that mainly depend on monitoring worker behavior indicators such as interaction events [21], [22], eye tracking [23] or time-based activities [24]. Additionally, reliability could be estimated by measuring the worker's effort in a task [25].
Consistency analysis is one of these reliability workercentric measurements. It concerns of examining the workers' ability to adapt to themselves by assuming the same result when repeatedly given a same task. Research on consistency (intra-annotator reliability) based evaluation, where workers are evaluated on the consistency of their own answers, is ongoing. Such research will open new directions in evaluating crowdsourcing workers and enable further investigations on various factors. These factors include the workers' consistency across their answers, the effects of repeated or difficult tasks on workers' consistency, and the relationship between the accuracy and consistency of workers. Moreover, consistency-based quality control can be compared to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ traditional approaches for quality control. Despite the interest in consistency as a reliability measurement in various fields such as healthcare, pervasive computing, and machine learning [26], consistency-based quality control in crowdsourcing has not yet been investigated extensively. Inter-annotator reliability, i.e., consistency between workers, is used to evaluate workers by comparing their results with the results from their peers [27]. Targeting intraannotator consistency, [28] used time limitations and the workers' errors to evaluate the workers in different types of tasks. The authors in [29] studied worker consistency over the long-term. Reference [30] used pattern recognition estimation of the consistency of data annotators based on their annotations on similar images. Exploring the consistency of relevance judgments is studied in [31]. They examined different factors that affect the judgment such as distance between the duplicated documents and the topic of the documents. Other work, [32] explored the consistency of participants in three replicated surveys by asking personal information and motivation. The consistency in [26] is measured using the absolute errors of workers counting objects in duplicate images. Another work applied inconsistency score measure by duplicating randomly set of questionnaire questions twice [33]. They used the weighted Euclidean distance to measure the consistency.
The contributions of this article can be summarized as: • A main contribution lies in its proposal of a novel reliability model for crowdsourcing based upon the consistency of workers. This novelty is represented by applying the pairwise comparison method instead of using traditional distances calculations. Also, the using of multi duplicates of text labeling tasks compared to only single task duplicate. Moreover, in contrast to previous work, we study a different pool of workers, namely workers with intrinsic motivation rather than paid workers. Furthermore, in term of performance, our model achieves an average accuracy outperforms other competing methods [26], [33].
• Other contributions are the results and findings that reveal the consistency level of the unpaid workers and illustrate the effect of different task factors on the worker consistency. Furthermore, the dataset which is the first available dataset of consistency in crowdsourcing.
This article is organized into eight sections. Section 2 provides related work. Section 3 describes the problem formulation. The proposed model is illustrated in Section 4. Section 5 presents the experimental results. Section 6 analyses the performance of the proposed model. Section 7 discusses future works. The last section is the conclusion.

A. WORKER RELIABILITY
Traditional methods for measuring crowdsourcing workers include those based on simple human evaluations. Workers are evaluated by normal workers who are independent val-idators chosen to assess the answers of other crowdsourcing workers [34]. A more common approach is the majority vote where multiple workers work on the same tasks and the correct answer is taken as the one with the majority vote [29], [31], [36]. The use of gold standards (ground truths) is another popular approach where high-quality answers or labels are already known. The reliability of the workers can be evaluated by comparing their answers with the gold standard. These datasets can be created by injecting a few ground truth labels from experts in rich crowd labeling [38] or by gathering a set of experts [14]. The datasets can also be generated automatically from just a few gold unit seeds [15]. Moreover, A real time system can evaluate crowd workers reliability using a collected reference set [16].
A recent approach is worker behavior analysis, where the workers' quality is measured by tracing the behavior of the workers as they perform their tasks. The authors in [21] proposed a task fingerprinting approach based on recording sequential logs of interface events of what the workers did and when. Similar work was performed in [39], which presented an approach called ''Application Layer Monitoring'' and studied three time aspects (completion, working phases, and consideration). Others works analyzed the behavior of workers at different times [40] or in terms of personality traits [41].

B. CONSISTENCY BASED MODELS
Consistency-based research in crowdsourcing is still limited. Peer-consistency, for example, is used as an alternative to the gold standard for evaluating workers by comparing their results with their peers using a bonus as a motivator [27]. Focusing more on consistency, [28] evaluated the consistency of workers in different types of tasks with time limitations and compared the number of errors made by the workers. Focusing on intra-annotator consistency the time length taken by the workers to complete tasks, [29] found that workers gave consistent answers over long-term settings. The study in [32] found that 30% of participants in a survey were inconsistent when they took the same survey twice. Work [30] used galaxy images annotation data to study the annotators consistency. They compared the labels of workers for the same image as binary scale. They recommended their method to enhance the quality of training data as input for supervised machine learning algorithms.
Investigating how accurately workers judge the relevance of duplicated documents, [31] found a high level of inconsistency. They studied the possible sources of errors such as documents' topics length and distances between documents. They found that less distance and leads to high consistency and assumed that extremely long topics do reduce worker inconsistency.
Another study [26] explored consistency as a reliability measurement in crowdsourcing by using the absolute errors of workers counting objects in duplicated images. They studied the effects of different factors on task consistency. They found that, generally, the difficulty of the task decreased the consistency, image transformation had no significant effects on consistency, and increasing the offset between duplicate images decreased the consistency. Other work [33] applied inconsistency score measure on psychometric questionnaire. They duplicated a set of questionnaire questions twice and calculated a weighted Euclidian distance of workers' duplicated answers. Their point Likert scale ranging from 1 to 7. Their method detects only 31% of the invalid responses.
Pairwise comparison method extensively considered in various other domains such as operations research, economics, engineering. Its' main application is a multi-criteria decision making tool. It supports in evaluating the decision makers and ranking alternatives [42]. To the best of our knowledge, this is the first work that employs a pairwise comparison to measure the crowd workers' consistency.
Our work differs from prior work in several dimensions. We implemented a more advanced consistency measurement (pair-wise matrix), studied a text labeling task, using multi duplicates of same tasks, and targeted workers with intrinsic motivations rather than paid crowd workers.

III. PROBLEM DEFINITION
The consistency reliability measurement of crowdsourcing workers should be definable. We define the problem of measuring the worker consistency in this section.
The set of workers who participate in the labeling is formally defined by a vector: where n is the total number of workers. These workers process a set of statements formally defined by a vector: where n is the number of statements. Each of these statements has three duplicates. Each statement duplicate SD has a set of characteristics: where p is the placement, d the difficulty, and r the rephrasing. The total number of statements is n * size(SD). These statements are queued randomly to the workers who are asked to label the statements. The labels can be binary or, in our case, fall inside a set: where n is the number of labels. Each statement s should have a label l given by a worker w.
If we assume that the label is binary 0, 1} with just a single duplicate, then the worker consistency is measured as follows: While worker w i is still labeling statements, we check if his/her labeling of a statement l w i s j matches his/her labeling of the duplicate l w i sd j . If the labels are matched, then worker w i is consistent for this statement. Otherwise, he/she is inconsistent. This comparison is repeated until the worker finishes all the statements.
In our case, there is scale of labels and three duplicates. The measurement will be as follows: While the workers are still labeling statements, for each statement s j labeled by worker w i , the Pairwise Matrix is calculated using the pair-wise errors (differences) between the labels of the statement and its duplicates l w i s j , l w i sd pj , l(w i sd dj ), l(w i sd rj ). This gives six differences. These differences and their reciprocals are written as matrices. The ConsistencyRatio for this worker w i is then calculated and then compared with the ConsistencyThreshold. If the consistency ratio is less than the threshold, then worker w i is consistent, otherwise he/she is inconsistent. This is repeated for all the workers.
There are some factors that affect the consistency of the workers. We ask a few research questions about these factors. The first factor is the type of the workers. In our study, the workers were unpaid volunteer workers. We thus ask the first Research Question 1: [RQ1] Will unpaid workers achieve higher consistency results?
Other factors are related to the three duplicates with different characteristics of the task performed by the worker. In contrast to [26], we used a text-based task with three defined factors. The first factor is the placement, which is the position of the statement in the queue. The second factor is the difficulty, where we provided less information about the claim and no clear judging rule. The third factor is the phrasing, where we changed the claim by rephrasing the statement. We ask Research Question 2: [RQ2]: What is the effect of each of the three factors on the workers' consistency?

IV. PROPOSED MODEL
The proposed model comprises of a few main components that work in an algorithmic manner as explained in Figure 1. The first component is data collection, where we scrapped a fact-checking service and stored the scrapped data in a database. The second component is the design of the tasks that will be delivered to the workers later. The third component is our consistency algorithm. The input to this algorithm is the workers' labels which are already stored in the database. The output of the algorithm is a set of matrices from which the consistency ratios are calculated and then used as input for analysis. These components are described in more detail in the following: A. DATA COLLECTION Fake news is a recent phenomenon in social media and requires more fact-checking efforts to counter it [43]. Political data, like statements from politicians, is one type of data which is susceptible to fake news [44]. We scrapped PolitiFact 1 [45]- [47], which is a platform that provides a fact-checking service called Truth-O-Meter presenting truth ratings of claims from politicians based on investigations by journalists. We randomly selected eighteen statements from three politicians (Barack Obama, Hillary Clinton, and Donald Trump) related to different topics such as personnel matters, taxes, healthcare, and the military. These statements along with three duplicates made up a total of seventy-two statements. The total set of labeled statements comprised 792 statements collected from 11 unpaid volunteer workers. We followed the PolitiFact scale for the truth of each statement. This is a six-level scale to represent the degree of truth, namely [Extremely false, False, Mostly false, Half true, Mostly true, and True]. We selected this scale to serve as a ground truth to be used later for accuracy measurements, and for ease of adaption to our pairwise method design. For the original 18 statements, we tried to balance the categories. There are 3 extreme false, 4 false, 3 mostly false, 3 half true, 3 mostly true, 2 true. About the total 792 statements, since each 18 core statements duplicated four times, so each worker of the 11 ones is asked to label 72 statements. They categorized as 12 extreme false, 16 false, 12 mostly false, 12 half true, 12 mostly true and 8 true statements. To share the dataset with the scientific community, we make it publicly available at: https://github.com/fattoh/Politi_Stat.

B. TASK DESIGN
One approach to fact-checking is to fact-check individual claims [43]. Crowdsourcing tasks can be used to classify such claims or statements [44].
The task in the experiment began with a set of guidelines. The presence of such instructions increases the reliability of the workers [48]. An illustrative example was provided as part of these guidelines, since this is a recommended practice [49]. After reading the guidelines, the workers could proceed to label the statements. The statements were shown sequentially with a judgement rule for each statement that was the same as the rule in the PolitiFact service. The rule gave a summary of facts, statistics, or research studies about the statement to provide the worker with evidence to support his labeling. The last page was a set of questionnaire questions to collect feedback on the task and experiment.
We created three duplicates for each of the eighteen original statements. The order of the original statements was manually seeded and the other duplicates were then randomly distributed. The original statement (SO) was the raw statement with the judgement ruling. The first duplicate (SD1) was the same as the SO but with a position offset that determined the distance between SO and SD1. This offset was determined randomly. The second duplicate (SD2) was a more difficult task. We replaced the judgment rule with some inconclusive clues about the statement by editing some paragraphs from the statement discussion on the PolitiFact service. The third duplicate (SD3) was a rephrased statement. An example of statement with its duplicates is shown in Table 1.
The PolitiFact scale is converted to corresponding numbers as (Extremely false = −7, False = −5, Mostly false = −3, Half true = 3, Mostly true = 5, True = 7). This scale is selected following the reference scale (PolitiFact scale) and with choosing small values according to [50]. We also chose this assignment to ensure larger distances at the scale extremes. This design allows implementing such tasks using the traditional crowdsourcing platforms like Amazon Mechanical Turk (AMT) 2 . The participants in this experiment were unpaid volunteers. They are a PhD candidates in the College of Computer Science at King Saud University. These students were selected from the pool of high graduate students who have reasonable background of crowdsourcing, where they have performed crowdsourcing tasks before. They motivated using the social human interaction in academia as a community internist motivation [51]. About biasness, since that background could affect the truthfulness [52]. We expected an unbiasedness according to their lower interesting in politics as they told in the post-questionnaire.

C. WORKER INTERFACE
We built our own task website [53] to perform several experiments, as shown in Fig 2. The experiment mainly studied the consistency of unpaid workers labeling a set of US politicians' claims and how different factors affect the consistency. The website consists of: (i) a set of guidelines to help the worker in his labeling as shown in Fig. 2(a), (ii) the tasks comprising the claim statement and the guiding rule with the labels given as radio buttons as shown in Fig. 2(b), and (iii) a final post-questionnaire about the difficulties encountered and general comments.

D. PAIR-WISE CONSISTENCY ALGORITHM
Since worker consistency has a more pronounced impact on the annotation task than any other element, it was necessary to inject some random elements into each task before the annotation process was started. Our thorough examination of this research problem revealed that the most impactful factors are the difficulty of the statement, followed by the offset of injected statements, and finally the rephrasing of the statements. The ranking process was further complicated by the fact that some of the statements were qualitative and hence could not be evaluated with fully automated methods. To counter this, a single instance of external knowledge importation was made in which a person with the relevant expertise generated the ground truth matrix that described the impact of every considered statement.
The pairwise comparison method was chosen to evaluate the worker consistency. After the evaluations were made for each pair of statements, the outcome was recorded into the matrix. Because by definition, the difference cannot be taken between each statement and itself, the diagonal dimension of the matrix was populated exclusively with '1' values. Otherwise, the direct comparison indicates if a certain statement was rated by the worker to be more true or false than the statement compared against. For example, if the workers evaluate a statement by giving it the value of S, then this indicates that the statement contributes to only 1/S of the predicted value that the second statement can provide. The entire matrix was populated with paired values obtained in this fashion, allowing for precise understanding of the relative comparison for each statement.
Once this matrix was fully filled, a set of priority vectors was determined through the following mathematical procedure: The maximum combined value of the entire set was estimated based on the matrix eigenvectors, after which the matrix was normalized by having each field divided by the summarized value and the priorities were formulated as vectors, as exemplified in Algorithm 1. The process was cyclical, and direct comparisons were made until all possible couplings of the statements have been exhausted. The algorithm assumes the perspective of an unbiased worker who is making rational evaluations based strictly on the outcomes of the pairings. Because such an annotator would have to make precisely defined choices and aim to not contradict himself, a certain number of statements can be reordered, paraphrased or rewritten and added to the dataset to measure his/her consistency easily. This analogy allows us to formulate the ground truth matrix in such a way that its consistency is ensured at a high level, although some contradictions may still occur due to various unintended events. For this reason, an earlier developed metric called the consistency ratio (CR) was introduced into the model and used to optimize the matrix. This variable was calculated starting from a simpler measure known as the index of consistency (CI), which was compared with a random index (RI) to find the appropriate ratio. RI have computed and obtained depending on a simulation of random pair-wised matrices [54]. This approach was instrumental for the identification of the eigenvector with maximal value. In effect, this resulted in the creation of a symmetrical matrix that was guaranteed to have the maximum consistency under the circumstances. A realistic limit of CR>0.1 was implemented according to [54], [55], and any variations that resulted in a value above this limit was eliminated from consideration. The exact process of obtaining CR values from the available data is presented in the Algorithm 1.
Example 1: An illustrative example of Algorithm 1. The pair-wise consistency matrix S ij for Worker w and Statement S is As discussed above, the diagonal elements of the matrix must be 1, and the matrix must satisfy the reciprocal relation S ij = 1/S ji . To determine the pair-wise consistency values in the matrix, we calculated the absolute distance/difference (i.e. ignoring the sign) between the ratings of the two compared statements. The difference will be stored in the matrix as S ij and its reciprocal location will equal S ji = 1/S ij . If the difference is zero, then the value in the matrix will be set 1 to reflect the perfect consistency.
For example, consider a worker who rates the original statement S 1 as 'Mostly False', which corresponds to −3 in our scale, and rates its offset duplicate S 2 as 'Half True' which on the scale corresponds to 3. Then the value S 12 = The summations of the columns according to (2) are 1.39, 7.75, 11.50, and 17 respectively. The (S normalized) matrix named T is calculated by dividing each element in the matrix by the summation of its column: We store l as length of this vector = 4 (line 7) to be used later (line 12). After that, we compute the consistency matrix µT j by multiplying the pairwise matrix S with the vector W divided by the weighed sum vector w of each row (lines 8,9,11): Continuing in example and according to (5): The λ max is calculated using argmax µ T (line 10) as Following (6): λ max = 4.074, which is close to l.
Finally, we compute C index and W cons as: where W cons is the consistency ratio CR that determine if the worker consistent in this statement or not compared to the threshold. And so (lines 13,14) according to (7) and (8): and W cons = 0.0247 0.9 = 0.0274 where 0.9 is a random index for our case corresponding to a 4 × 4 matrix, n = 4 in [54]. For this statement (Matrix S), the worker has W cons = 0.0274, this is ≤0.1 which is our threshold for every statement (β = realistic limit of CR). This indicating that worker is consistent for this statement (lines [15][16][17][18]. For the total consistency for this worker, our threshold β is the average W cons of all workers for all 72 statements. Albeit the time complexity of the proposed algorithm is not a big concern. The summation of columns of matrix S at lines (2,3) is O(n 2 ). Then getting the matrix T by normalized (averaging) S at line 4 is also O(n 2 ). Then at line 5 the complexity of multiplying a matrix by Vector W is O(n 3 ). Finally, the complexity of the eigenvector µ T = O n 4 at lines (7-9) requires l * n * n * l and n ≈ l. So, the running time is 2 n 2 + n 3 + n 4 and consequently, the time complexity = O n 4 .
To study the effects of placement/offset, difficulty, and rephrasing, we used the pair-wise comparison method in algorithm 1 with n = 3. For each factor investigated, we excluded the factor to study the effect of exclusion. For example, to study the effects of difficulty, we excluded the difficult statements from the matrix and computed the consistency index and ratio and then compared them with the consistency index and ratio of the complete 4 × 4 matrix. This procedure was repeated with the offset and rephrased statements.

A. EXPERIMENTS SETTING
All experiments were implemented on a PC with Intel Core i7-3770 CPU @3.40GHz and 12GB memory. The development used were Python 3.8 language, Django web framework, and SQLite database.

B. GENERAL OBESRVATIONS
To evaluate the total performance of the workers, we compared the average performance of their labeling against the VOLUME 8, 2020 ground truth. We used the Mean Absolute Errors (MAE) and a consensus-based measurement to measure the worker accuracy. For the MAE measurement, we calculated the mean absolute errors/distances between the ground truth and the label of each statement (original and duplicates) as follows: where GT is the ground truth, l(s) is the statement label given by a worker, and n is the number of statement's duplicates n = 4. Then for each worker, we calculated an accuracy score from the mean of the MAE. across all core 18 statements via where m is the number of core statements. The consensus-based measurements is ranging from simple majority voting by the workers up to complicated statistical and machine-learning models [37]. These methods are mainly helpful in cases where the ground truth is absent [56]. We used the consensus measure proposed by [57].
We scored the workers based on the absolute difference between the worker's label of a statement and the median label of all other workers for the same statement: w 1 , s) , l (w 2 , s) , . . . , l (w m , s) |) (11) where m is the number of workers. Subsequently, the worker's score is the median of all statements' scores in (11) as: Consensus Score (w) = mean(Consensus Score (w, s 1 ),. . ., Consensus Score (w, s n )) (12) where n is the total number of statements.
To experiment other measurement, we establish Consistency Baseline measure Cb [26] where Cb calculate the absolute error/difference between the worker label (as scale) of the original statement and worker the label (as scale) of the duplicate statement. In this work, we defined three baseline measures for each worker. We calculated as follows: The baseline consistency measure is the sum of these three measurements: Cb total = Cb p + Cb d + Cb r (16) We calculated the mean baseline consistency and mean consistency for the three duplicates for each worker. We used the mean of the (MAE) and consensus score of the workers for all the statements as shown in Fig. 3. We observed that there is an approximate uniformity of the performance in the mean MAE, consensus score, and ground truth across all the 72 statements. This gives a general indication of the quality of their work. There were no random labeling or extreme differences in labeling that affected the average performance. It was expected that unpaid workers would label more honestly, compatible with [2], [58]. Also, we observed that there were no extreme judgements of Extreme False or True, and the workers always labelled away from the extreme judgements. This could be explained by their diminished confidence as they were not sure 100% about the truth of each statement.

C. CONSISTENCY SCORES
With respect to answering RQ1, we found that as expected, unpaid workers achieved very high consistency scores. From   4, we observed that all workers attained very low consistency ratios compared to the supposed realistic limit of 0.10. Even the worker with the worst score, worker 8, had a score of around 0.0183 that was still far less than the limit. This means that all workers achieved high consistency.
About Inter-annotator consistency, i.e. the consistency between workers, since we have more than two workers, we used Fleiss' Kappa which is a measure of agreement between multi-workers.
κ =P −P e 1−P e (17) where this measure divides the degree of agreement that is attainable above chance, by the degree of agreement actually achieved above chance. Our result κ = 0.1. This is slight agreement. This is could be interpreted by the problem of the underestimation of agreement of Fleiss' kappa statistic in assessing high levels of inter-raters agreement as [59] argued. The high levels of agreement in our results are shown in Figure 3 where average consensus score is near gold truth in most of statements. We moreover, reduced the scale to binary [True, False] by merging the categories of the scale and rising κ to 0.27, which is Fair agreement.

D. EFFECTS OF THE FACTORS
In this section we present the results related to RQ2. The effects of the three factors on consistency were explored by comparing the overall mean consistency ratio against the means when each factor was absent. As the consistency ratio for all workers were nearly zero skewed, we normalized the data for all the ratios. We found the mean consistency ratio for all duplicates from all the workers to be 0.241. The mean consistency ratio of the pairwise matrix without the placement duplicates was 0.246. This is very slightly larger than the mean overall consistency ratio. This indicates that the absence of placement duplicates did not have any noticeable negative effect on the consistency of the workers.
The consistency across the duplicated statements highlights the honest labeling by the unpaid workers.
Regarding the difficulty, the mean consistency ratio for all workers without the difficult statements was 0.224, which was less than the mean overall ratio. This indicates that the absence of difficult duplicates increased the consistency of the workers. This was expected because the difficulty of the task could have led to differing impressions about the truth of the statement, and consequently the labeling.
Regarding the rephrasing, the mean for all workers was 0.207 in the consistency ratio matrix in the absence of the rephrased statements. This indicates that the effect of rephrasing was the same as the effect of difficulty. This was unexpected. We suspect that the rephrasing of the statements led to a distribution in the judgement, and hence, labeling, of a worker. All of these results are shown in Fig. 5.

E. RELATIONSHIP BETWEEN CONSISTENCY AND ACCURACY
To investigate the relationship between the workers' consistency and accuracy, we used the Pearson Correlation Coefficient r. We tested the correlation between the accuracy measure (mean MAE) (10) and the consistency measure (the mean of W cons (8)) of the workers. We found a correlation coefficient of r = 0.57, with p < 0.07. This indicates that there is a marginally significant positive relationship between the accuracy and consistency. The positive correlation was expected in our experiment from the high accuracy and consistency score achieved by the unpaid workers. Moreover, we statistically tested the r between the mean MAE (10) and the consistency differences score as a consistency baseline Cb measure (16). We found that r = 0.54 with p = 0.088. This is similar to the previous result that a worker with larger differences in his rating (i.e., less consistency) was likely to have larger errors compared to the ground truth.
We also studied the statistical relationship between the mean consensus score (12) and mean MAE (10) of the VOLUME 8, 2020 workers by using the r coefficient to estimate this relationship. We found r = −0.67 with p < 0.05. This negative correlation indicates that, as expected, workers with high consensus scores will have less errors with respect to the ground truth.

VI. PERFORMANCE ANALYSIS A. RELIABILITY ANALYSIS
We measured the reliability of our experiment with respect to the selected scale through the internal consistency of our scale. We used the Cronbach alpha [60]: where K is the number of core statements, which is 18, var(Y i ) is the variance of workers' labels of the statements, and var(X ) is the variance of the total labeling. The α in our experiment was 0.76, which indicates that it has good internal consistency.

B. ACCURACY ANALYSIS
To evaluate the performance of our model, due to the unavailability of consistency benchmark datasets, and a lack of works studying the consistency, we used the methods [26], [33] as baselines of comparison using our dataset. Williams et al. [26] introduced a method to calculate the consistency of 402 crowdsourcing workers. They created a dataset of 30 images and the task was asking to count the number of objects in each of them. a worker in each task counted objects in 10 images (two of them used as consistency probe. the same image with modification). Naderi et al. [33] presented a survey contains 74 items, which was conducted with a total of 256 participants. They measured the consistency using some randomly selected items, which are asked twice in the questionnaire. We compared our method, which uses pair-wised differences, against [26]. and [33]. Williams et al. calculated the absolute difference between a worker's outputs for the original task and its duplicate where output is the counting number of object in an image. Slightly similar, Naderi et al. calculated the differences between the worker's answers of same questionnaire item. They used the weighted Euclidian distance. Their weights were calculated using responses of all workers, which is the consensus, score (12) in our methodology.
For the accuracy comparison, first, for each worker across all statements we calculated the difference between the original statement and the duplicated one. We used pairwise difference in our case and absolute difference in case of Williams et al. and weighted Euclidian difference in case of Naderi et al. Then, we calculated the average differences in each case and used it as threshold. After that, the accuracy of each method for each statement is determined based on the threshold. Finally, the average accuracy of each worker for each method is calculated using his/her accuracy of all statements. Fig. 6 shows the accuracy of all workers for each method. It illustrates that our method archives higher accuracy than the contemporary for almost all the workers. In average, our model achieved 73% average accuracy, which surpassed the 61% of Williams et al. [26] and the 67% of Naderi et al. [33]. FIGURE 6. Accuracy comparison, our method vs. [26], [33].

VII. DISCUSSION AND FUTURE WORK
Measuring workers' reliability in crowdsourcing is a major challenge. Studying the level of consistency in their answers sheds light on their performance, and consequently their reliability. In this work, we studied the consistency of unpaid workers using a pair-wise comparison method to measure their internal consistency in rating the truthfulness of textual political statements. The effects of three different characteristics were examined in our experiment, namely the placement of the statements, the difficulty of the task, and the rephrasing of the statements.
Generally, unpaid workers perform repeated tasks in a consistent manner. This is expected because workers who are intrinsically motivated do well in the crowdsourcing [2], [58], such as in citizen sciences. An important result in this study is the consistency score of the workers. More accurate results were obtained from our model/method compared to the baselines [26], [33]. This can be attributed to the mathematical robustness of the pairwise comparison method compared to the limited approach of calculating the absolute differences of errors or weighted distances.
We compared our results for the effects of each characteristic with corresponding results from previous works, which differ from this work in terms of the measurements used, the pool of workers (unpaid vs paid ones), and the types of task (texts rating vs image objects counting). In our experiment, placement did not affect the consistency ratio. This could be because the completion time of the tasks included long breaks, as reflected in the post-questionnaire responses. Hence, placement-related effects like fatigue [61] would not be of impact. The results for the effects of difficulty are similar to those in prior work. The task difficulty affected the workers' consistency negatively. This is expected because inconsistent results are expected even in the absence of difficulty. Our results are consistent with the relationship between the task difficulty and reliability found by [62].
Finally, rephrasing had same effect as the difficulty in our experiment. This is different from previous works. An explanation may be the confusion resulting from the modified texts which was absent in comparisons between image transformation duplicates in previous works.
An additional observation is that because our task was about political statements, the workers' reliability could be vulnerable to the bias effect [44], [52]. This is true to a large extent for the workers' accuracy but not their consistency. We mitigated the bias effects by omitting the name of the politician who made the statement. Furthermore, the workers did not have a major interest in political affairs of the US. This matched our expectation, and was further confirmed by their answers on the post-questionnaire.
We expected other effects such as recognition. This was clarified by the answers given by the workers in the questionnaire which indicated that they suspected that some statements were duplicated. The tasks could therefore be susceptible to recognition which might result in workers changing their earlier answers. We mitigated these effects by disallowing the workers from going back to earlier tasks to ensure that the workers moved forward in the tasks, even when they were suspicious of the similarity of the statements.
Regarding the limitations in this study, we plan to extend our work to more crowdsourcing settings. The implementation of our consistency measurement for paid workers will be an interesting future work. Crowdsourcing platforms such as AMT have an abundance of paid workers. Extending our work by recording different performance characteristics such as workers' time per task, hover time, out of focus time, scrolling, and answer switching is another future work that will open promising future research. Such extensions will be the cornerstone for modeling and developing machine learning algorithms for predictions of worker consistency. The correlation between accuracy and consistency can also be investigated because workers can be consistent but not accurate. Other effects such as learning and fatigue can be studied. This, together with studying paid workers, will enrich the research on crowdsourcing and facilitate consistency measurement for more types of workers like spammers and Sybils.

VIII. CONCLUSION
In this study, we propose a new model for measuring the consistency of unpaid workers in crowdsourcing. Our experiment studied how workers labeled the truthfulness of duplicate political claims. We assessed their consistency and studied the effects of different characteristics. Our results show that the volunteer workers achieved high consistency scores. The accuracy of our model outperformed the state-of-the-art methods. Future work includes implementing our model for paid workers in a featured crowdsourcing platform. Another future work is to extend this consistency study to include worker features. This will help in the development of models for machine learning techniques and for predicting worker consistency and reliability. He has authored several papers in the refereed IEEE/ACM/Springer journals and conferences. His research interests include social media analysis, data analytics and mining, social computing, information credibility, and cyber security. He is a Student Member of ACM.
MUHAMMAD IMRAN (Member, IEEE) received the Ph.D. degree in information technology from the Universiti Teknologi PETRONAS, Malaysia, in 2011. He is currently an Associate Professor with the College of Applied Computer Science, King Saud University, Saudi Arabia. His research was financially supported by several grants. He has completed a number of international collaborative research projects with reputable universities. He has published more than 250 research papers in peer-reviewed and well-recognized international conferences and journals. Many of his research articles are among the highly cited and most downloaded. His research interests include the Internet of Things, mobile and wireless networks, big data analytics, cloud computing, and information security. He has been consecutively awarded with the Outstanding Associate Editor of IEEE ACCESS, in 2018 and 2019, besides many others. He served as the Editor-in-Chief for the EAI Endorsed Transactions on Pervasive Health and Technology. He also serves as an Associate Editor for top ranked international journals, such as the IEEE Communications Magazine, the IEEE NETWORK, Future Generation Computing Systems, and IEEE ACCESS. He served/serving as a Guest Editor for about two dozen special issues in journals, such as the IEEE Communications Magazine, the IEEE Wireless Communications Magazine, Future Generation Computing Systems, IEEE ACCESS, and Computer Networks. He has been involved in about 100 peerreviewed international conferences and workshops in various capacities, such as the Chair, the Co-Chair, and the Technical Program Committee Member. VOLUME 8, 2020