What Quality Control Mechanisms do We Need for High-Quality Crowd Work?

Crowd sourcing and human computation has slowly become a mainstay for many application areas that seek to leverage the crowd in the development of high quality datasets, annotations, and problem solving beyond the reach of current AI solutions. One of the major challenges to the domain is ensuring high-quality and diligent work. In response, the literature has seen a large number of quality control mechanisms each voicing (sometimes domain-specific) benefits and advantages when deployed in largescale human computation projects. This creates a complex design space for practitioners: it is not always clear which mechanism(s) to use for maximal quality control. In this article, we argue that this decision is perhaps overinflated and that provided there is “some kind” of quality control that this obviously known to crowd workers this is sufficient for “high-quality” solutions. To evidence this, and provide a basis for discussion, we undertake two experiments where we explore the relationship between task design, task complexity, quality control and solution quality. We do this with tasks from natural language processing, and image recognition of varying complexity. We illustrate that minimal quality control is enough to repel constantly underperforming contributors and that this is constant across tasks of varying complexity and formats. Our key takeaway: quality control is necessary, but seemingly not how it is implemented.

(seeking to ensure high-quality solutions). This article there-102 fore recognizes that there is a wealth of choice for researchers 103 and requesters [21], but that this choice adds complexity 104 to the design and implementation of crowd work. We note 105 that several researchers have highlighted some shortcomings 106 in the literature (see Related Work) that we seek to either 107 address or provide more context experimentally. We do so by 108 proposing the following research questions (RQ): In RQ1, the working hypothesis is that the format of the 115 quality control mechanism is critical for achieving specific 116 notions of quality. Building on this, RQ2 attempts to disen-117 tangle questions surrounding when to inject quality control 118 mechanisms in the task. We seek to experimentally address 119 the impact of differing quality control methods on contrib-120 utors' response quality (RQ1). Also, reflecting on [22] we 121 explore the role of task and interface design in the quality 122 and accuracy of the tasks (RQ2). 123 Aligned to the two research questions, we present two 124 factorial design experimental studies (see Study Design). The 125 first, a 3 × 5 design explores 3 different task complexi-126 ties in language processing using 5 quality control methods. 127 It emphasises the impact of quality control mechanisms vs. 128 task complexity when considering response quality. The sec-129 ond, a 3 × 2 × 2 design explores 3 quality control treatments 130 with 2 information highlighting approaches and 2 task order-131 ing effects within a simple image recognition task. It empha-132 sises how aspects of task design impact response quality. 133 In undertaking these experiments, we make the following 134 observations (see Results). 135 1) Consistently underperforming workers were repelled 136 by the simple announcement of a quality control mech-137 anism, regardless of what that mechanism was, or in 138 fact if one was actually present or not.

139
2) There was no statistically significant difference 140 between the quality control mechanisms applied.

141
3) Subtle considerations in the task design (e.g. making 142 key text bold, and the order tasks are performed) are 143 more impactful on quality than the effect associated 144 with a quality control mechanism. 145 Considering these observations, we argue that the wealth 146 of choice for researchers and requesters in achieving qual-147 ity control [21] only adds complexity to the design and 148 implementation of crowd work. Instead, we provide a set 149 of recommendations for practitioners based on the following 150 contributions (see Discussion and Conclusions): 151 1. Quality control mechanisms: we highlight that in some 152 cases, the presence of a quality control measure alone is 153 sufficient to ensure high(er) quality solutions. This is key 154 for crowd requesters, as Rzeszotarski and Kittur [21] note 155 requesters must make difficult trade-offs depending on the 156 quality control method they use, yet our results illustrate 157 99710 VOLUME 10, 2022 this may be an over-emphasized issue: we could not observe discernible differences between increasingly more sophis-159 ticated measures (RQ1 and experiment 1). Similarly, as 160 Difallah et al. [23] suggest, discouraging low quality work (or 161 ''cheaters'') is better than controlling the quality of results. 162 2. Task design: via RQ2 and experiment 2, we provide 163 insights into task design aspects and their relationship with 164 observable differences in quality. Newell and Ruths [24] state 165 that intertask effects could create a systematic bias (if left 166 unchecked), and they note the importance of task design.  178 Quality control for crowd platforms is a highly-studied phe-179 nomenon as quality is a major attribute of the crowd [   Users were first asked verifiable, quantitative questions and 213 then to rate the article. They also provided 4-6 keywords as a 214 summary for the article. The results of the subsequent experi-215 ment demonstrated a significant positive correlation between 216 the workers' ratings and the Wikipedia admin ratings. The 217 combined findings indicate the utility of combining objective 218 and subjective tasking in micro-task markets [40].

219
A recognised attribute of crowdsourcing platforms is that 220 the platforms do not identify workers nor guarantee the qual-221 ity of the work, which can contribute to in the unreliability of 222 the system [23], [41]. In their work, Difallah et al. categorized 223 'cheaters' a priori and posteriori and discuss anti-adversarial 224 techniques for encountering them. They suggest that sophis-225 ticated task formulation as a suitable obstacle for cheaters. 226 Requesters' main goal is receiving high-quality, done work 227 thus discouraging 'cheaters' from doing a task in the first 228 place is more goals compatible than controlling the quality 229 of completed tasks. However, more sophisticated or compli-230 cated task structuring increases the burden on the requester. 231 They propose traditional anti-spamming techniques such as 232 CAPTCHA as sufficient barriers to 'cheaters'. Several com-233 mon approaches for quality control exist that are discussed 234 next.

236
Pre-selection mechanisms have two main branches which 237 are differentiated as ''up-front task design'' and ''post-hoc 238 result analysis'' [42] to control work quality in a crowdsourc-239 ing context. Researchers have utilized various techniques to 240 apply pre-selection methods. Crowdsourcing platforms gen-241 erally provide a mechanism for requesters to pre-select con-242 tributors based upon specific task requirements or requester 243 preferences. Geiger et al.
[43] typify pre-selection as ''a 244 means of ensuring a minimum ex-ante quality level of con-245 tributions.'' Otherwise stated, a requester uses a pre-selection 246 process like a test as a risk mitigation technique against poor-247 quality solutions. Namely requesters screen potential contrib-248 utors based upon the demonstration of certain knowledge, 249 skills, or attributes via platform-specific process.

250
Pre-selection is typically performed via multiple-choice 251 tests, which Oleson et al. [12] examined and subsequently 252 criticized due to a faulty key assumption: if the contrib-253 utor passes the test, they will then perform the task well 254 even in the absence of direct or tangible incentives to do 255 so. Likewise, contributors who fail the test may be banned 256 from the task though not necessarily for the right rea-257 sons. Gadiraju et al. [44] found that identifying workers' 258 behavioural traces can help with classifying worker in differ-259 ent types that will then improve the quality of work produced 260 significantly. This improvement was more significant in high 261 complexity tasks.

262
Self-assessments as a pre-selection technique produced 263 promising results in providing a strong indicator for workers' 264 competence and potential performance [45]. This method is 265 simple to implement and has been found to perform well. 266 Because of additional unremunerated efforts required and the 267 VOLUME 10, 2022 demonstration in advance of credentials in this design, pre-268 selection via qualification tests also likely acts as a barrier 269 to ''spammers' ' [45]. This is however a double-edged sword 270 as diligent contributors also may not select the task due to 271 increased unremunerated effort or missing credential on their 272 part. Answers to qualification tests or generic credentials may 273 also be shared amongst users, which reduces effectiveness of 274 the QA method [46], [47].  [51] propose proxying 300 trustworthiness based on prior experience. It is reported that 301 worker-requester trust has a positive impact on the reliability 302 of the crowd work [19]. The authors suggest that one way 303 of enhancing worker-requester trust is to flag and scrutinize 304 workers with sub-optimal responses rather than rejecting their 305 work and not paying them.

306
To provide a basis for comparing and estimating contrib- to assess solution quality and contributor attributes. In their 318 approach, Oleson et al. [12] inject known solutions into the 319 task as subtasks and contributors receive instant feedback on 320 the accuracy of their performance. The presence and quality 321 of these subtasks enables the accuracy of a given contributor 322 to be estimated in-task. As it is in-task, it also helps to improve 323 the quality of workers' solutions by providing an explanation 324 of why the solution is incorrect. The approach, however, 325 is inappropriate for tasks that rely on forms of subjectivity as 326 the design requires a finite set of definite answers. However, 327 such a mechanism also provides a basis to train contributors, 328 enabling self-evaluation of performance through feedback. 329 The latter facilitates an integral element in the definition of 330 competence: the evaluation of self-efficacy.

332
Completing meaningful tasks leads to motivation in the work-333 place [52]. Meaningful in this context implies that the worker 334 is both doing work with purpose and receiving acknowledge-335 ments for accomplishments [53]. Chandler and Kapelner [26] 336 transferred these findings into the crowd environment show-337 ing an interdependency between how a task is framed and 338 outcome in terms of work output. Motivating task rationale 339 in terms of expressing a purpose and higher goal led to a 340 significantly higher willingness for participation and quantity 341 of output.

342
Quality control is a dimension of Quinn and Bedersen's 343 human computation classification [54]. They caution that 344 even motivated users might cheat or sabotage the system. 345 We argue that the rationale behind subpar performance is 346 that the motivation typically studies is extrinsic rather than 347 intrinsic motivation [55]. Ke et al. [56] investigated the role of 348 intrinsic motivation in adoption of Enterprise Systems among 349 employees from the lens of self-determination theory. The 350 authors investigated if inducing intrinsic motivation results 351 in better and smoother adoption of Enterprise Systems in an 352 organization. Their findings suggest that individuals' intrin-353 sic motivation should be enhanced to adopt or explore new 354 systems. 355 Ryan and Deci [57] define extrinsic motivation as ''the per-356 formance of an activity in order to attain some separable out-357 come'' or the performance of an activity to avoid punishment. 358 Zhao et al. [58] studied the role of extrinsic motivation in 359 having individuals share their knowledge in Q&A sites. They 360 argue that while extrinsic motivation, when used as a reward, 361 could help increase participation and knowledge sharing it 362 might also interact with intrinsic motivation, impacting self-363 esteem and self-actualization. It is unknown to which degree 364 this interaction between extrinsic motivation and intrinsic 365 motivation impacts quality control, which weighs towards the 366 punishment end of extrinsic motivation, in the crowd.

368
There is a significant amount of work on both assessing and 369 trying to ensure the quality of crowd work. These approaches 370 typically reside prior to the undertaking of a task (e.g. quali-371 fication tests), or in-task (e.g. gold standards, redundant task 372 scheduling). Choosing the ''right'' measure for a given task, 373 however, is challenging, as many researchers have proposed 374 many different quality control / assurance measures [21]. 375     Contributors were prompted to access the task on our own 404 web page. This allows for confounding variable control, per-405 sonalised feedback, and performing our own quality control.

406
The website created a unique code that contributors use to 407 receive their payment through the Crowd Flower interface 408 FIGURE 2. Crowdsourcing interface for the second experiment, here illustrating that the worker should recognize and then multiply the two images. Shown is a non-bold verb, control group view.
after completing the task. The user interface ( Figure 1 and 409 Figure 2) was identical for all conditions across both exper-410 iments. The same interface was used for collecting human 411 judges' quality ratings in experiment 1.

412
We used between-group designs where each task had its 413 own population (groups had no overlap among populations). 414 To ensure this, we used IP-tracking and browser fingerprint-415 ing to ensure that contributors do not contribute to more than 416 one condition, as well as corresponding constraints specified 417 via the Crowd Flower and Figure Eight platforms.

418
In Figure Eight, 60% of the workers are male, and most 419 of the workers are between the ages 18 and 34 years, aligning 420 with recent assessments of crowd labour participants [58]. For 421 this study, we did not collect demographic information as it 422 did not serve the aim of the experiments. Only hashes of IP's 423 and browser fingerprints were stored to maintain participant 424 privacy. Table 1 and Table 2

428
Our first experiment investigates three tasks of varying com-429 plexity using a three (task complexities) by five (quality 430 control methods) factorial, between-group design. Following 431 the experimental design of [1], the effort for completing each 432 task is as high or higher than for cheating, which should 433 disincentivize constant underperformance. We hypothesize 434 the levels of complexity to be semantic similarity (least com-435 plex); question answering (more complex); and text transla-436 tion (most complex).

437
Each task is repeated five times with one of five dif-438 ferent quality control methods: none, fake, intro, auto, and 439 wizard. First, in level (none) we performed no quality con-440 trol. We announced very prominently in the task description 441 that we use introductory quizzes to check the contributors' 442 VOLUME 10,2022 qualifications, yet contributors did not actually undertake a 443 test for the (fake) level. The third level (intro) announced an 444 introductory quiz and required contributors to complete the 445 quiz with 80% accuracy which is akin to many qualification 446 tests (cf. 'Qualification Tests').

447
In the fourth level (auto) we added a basic machine learning 448 (ML) system to estimate the quality of response. The ML sys-449 tem uses a three-level scale: good, acceptable, unacceptable. 450 This estimate was reported to contributors making it akin to   All judges were not informed about the details of the 497 experiment but had experience in crowdsourcing. Judges saw 498 the initial request, answer, and additionally had a slider to 499 rate the response quality ( Figure 1) which was not shown on 500 the contributor interface. Responses from all conditions were 501 randomly selected and judges were not informed of which 502 condition a response came from. They were asked to judge 503 performance based on the description of the task as shown to 504 the contributors. 505 We measure and report the agreement between judges 506 using Krippendorff's Alpha [68]. [66] and [67] illustrate 507 that in a scenario of ten equi-distributed classes with a tar-508 get Alpha value of 0.8 or higher, a sample size of 293 is 509 sufficient to judge this Alpha level with a p-value < 0.05. 510 As we collected more than 1000 samples, our expected 511 p-value is < 0.005 for an Alpha level of 0.8, which accord-512 ing to Krippendorff is substantial. As illustrated below the 513 provided description was adequate as the observed agreement 514 between judges was substantial with a p-value < 0.05. 515 We calculated the average perceived response quality for 516 each contributor as our quality measurement. We consider 517 contributors with an average perceived response quality of 518 40% unacceptable responses, or below 0.6, as constantly 519 underperforming. The value of 0.6 was chosen regard-520 ing the ability to recover high quality answers from noise 521 input. A commonly used method for recovering high quality 522 responses from noise human input data is Expectation Max-523 imization. As [65] showed with five raters with an average 524 consistent performance of 0.6 or above a final Cohen's Kappa 525 of 0.9 can be achieved. 526 Additionally, we measure the correlation between our 527 ML-systems prediction and our human judges. As our data 528 violates the assumptions of the Pearson Product-Moment 529 correlation we use Spearman's ρ. Ground truth data was 530 acquired from the human judgement data. We selected only 531 the samples on which judges achieved full agreement on and 532 selected 30 samples per class. Classifier showed a Cohen's 533 Kappa of > 0.75 in unbalanced test sets resulting in accuracy 534 levels of 0.8 -0.92 for class balanced test sets. These results 535 are consistent with [61].

536
In line with [37] and [38], instruction clarity and contribu-537 tor satisfaction were tested using the built-in metrics provided 538 by Crowd Flower for all three tasks. Upon completion of a 539 task, contributors could opt into a satisfaction survey. Con-540 tributors score the task on a 0-5 scale for overall satisfaction, 541 instruction clarity, fairness of test questions, payment, and 542 ease of job. Results of these surveys are reported with each 543 task.   in the second half of the task, with it disabled in the first half 606 (final feedback group).

607
Upon completing a micro-task, workers received standard-608 ized feedback responses: ''Response recorded'' when feed-609 back was disabled. When feedback was enabled a correct 610 solution would reveal ''Your answer is fine'', and an incorrect 611 solution ''Other workers have disagreed with your response''. 612 Where the latter response aims to indicate that the answer was 613 not known a priori. To further increase the potential effects 614 of quality control-based feedback, workers were not able to 615 edit their answer once it was committed to the system, thus 616 encouraging later solutions to be cognizant of any feedback 617 received. 618 We classified responses as either correct or incorrect, 619 resulting in dichotomous quality representation. We refrained 620 from notions of partial correctness in this experiment, 621 as firstly, this is captured in the first experiment, and secondly 622 it is difficult to define a meaningful representation of partial 623 quality without additional contextual information, such as 624 whether the worker misread the image, performed the wrong 625 arithmetic operation, inadvertently struck the wrong key or 626 pressed enter too early etc. vs. having insufficient interest 627 in providing a valid answer. Yet, two aspects are consistent 628 among these examples: (un)intentional human error, and due 629 care and attention to detail, which the provision of feedback 630 will highlight to the worker. Many of these scenarios can also 631 be accommodated in the analysis of the experimental data. We considered three independent variables: the feedback sce-634 nario (control, initial feedback, and final feedback), whether 635 the instruction verb (add/multiply) is bold or not, and whether 636 the worker started with addition or multiplication as well as 637 one dependent variable: mean response quality.

639
In all conditions for the first experiment, contributors were 640 shown three examples of correctly solved tasks and a descrip-641 tion of the task, in the second, only a task description was 642 shown. Table 1 shows the distribution of our contributors 643 by level of quality control method and task complexity for 644 the first experiment. Table 2 shows the distribution of our 645 contributors by treatment group, whether the key instruction 646 verb was bold and whether the task started with addition or 647 multiplication 648

A. EXPERIMENT 1 WORD-BASED SEMANTIC SIMILARITY 649
Humans are better than algorithms at rating semantic sim-650 ilarity between two words [6]. Semantic similarity plays 651 VOLUME 10, 2022   annotators with access to the internet, we designed a set of 675 50 questions so that using the question as a search string will 676 not reveal the correct answer right away. 677 We randomly selected 10 questions to be test questions for 678 conditions with an introductory test (Intro, Auto, Wizard). 679 We designed sets of possible answers to these 10 test ques-  Responses within a margin of one standard deviation were 691 considered acceptable.

692
Each contributor could answer up to 80 questions. We col- Text translation is a demanding task even for humans as in-698 depth knowledge of two different domains, the target and 699 the source language, is required. Various approaches exist; 700 applying crowdsourcing to translation targeted paraphras-701 ing [80] and iterative collaboration between monolingual 702 users [81] are two examples. Other common approaches 703 utilize mono-or bilingual speakers to proofread and correct 704 machine translation results [82]. For our experiment, we use 705 respectively a popular Wikipedia article in German and Viet-706 namese. Native speakers of German and Vietnamese prepared 707 a set of sentences from this article. For the set, we took 708 the first 150 sentences from the respective article. Headlines, 709 incomplete sentences, and sentences that contained words in 710 a strong dialect were removed. We requested translations for 711 the remaining sentences from contributors via Crowd Flower. 712 As the target language was English, we used the same quality 713 prediction method for conditions that included a pre-test 714 as for the question answering task. Each contributor could 715 translate up to 100 sentences. We collected 2,119 translations 716 for the Vietnamese set and 2,002 translations for the German 717 set (total 4,121) from 90 contributors (46 on average). We col-718 lected 825 sentences on average in each control condition.  (Table 2 shows the 742 break-down across the 12 conditions).

744
Before we can contextualize results, we must first establish 745 that indeed task complexity influences response quality and 746 that we measure response quality reliably. 747  TABLE 3. Results of the self-assessment; it is not possible to calculate SD as Crowd Flower only offers aggregated data. From left to right the columns refer to overall satisfaction, instruction clarity, test question fairness, payment, and ease of job. FIGURE 3. Task complexity affects response quality. The most complex task text translation (right) has a significantly lower average response quality than the more simplistic semantic similarity task (left) and the question answering task (middle). The figure shows a violin plot combining a boxplot and a kernel density plot. Thick dark lines indicate 1st and 3rd quartiles the red lines population means.

748
We analyse effects for each level of the task complexity factor, 749 assuming that the average response quality deteriorates with 750 higher complexity tasks. As seen in Table 3 TABLE 4. Inter-rater agreement on perceived response quality. The results are homogenous for all three tasks and indicate a substantial agreement between our judges.

TABLE 5.
Anova results of main and interaction effects. The first row shows the effect of the quality control method. The second effect of the task. The third their interaction effect.

C. QUALITY CONTROL AND TASK COMPLEXITY INTERACT 773
As we have different numbers of contributors in our condi-774 tions, we also verified that our conditions have equal variance 775 for the dependent variable prior to executing an analysis of 776 variance (ANOVA). As the distributions do not differ signif-777 icantly from normal distributions we use Bartlett's test for 778 homoscedasticity (equal variance) [89]. We found that the 779 variance does not differ significantly between our conditions 780 t(4) = 2.764, p = 0.598. As our data does not hold evidence 781 that it violates the assumptions of the ANOVA, we anal-782 yse main and interaction effects with a two-way ANOVA 783 to compare the effect of quality control and task complex-784 ity on the independent variable perceived response quality. 785 Table 5 shows these results.

786
From the ANOVA results, we conclude that task com-787 plexity as well as the used quality control method have 788 a significant influence on the perceived response quality. 789 VOLUME 10, 2022 TABLE 7. Results of Welch two sample t-tests with Holm correction. Line 1 compares level semantic to level question of the task complexity factor. Line 2 compares level semantic translation and the line three question to translation.    Table 6 presents differences in levels of the control factor.  Table 7. Other levels do not differ significantly. Table 8 shows means and standard deviations between all 800 levels of our two factors. Figure 4 further illustrates that the 801 finding is constant for all tested tasks. 802 We also investigated the proportion of constantly under-803 performing contributors (a contributor below a quality level 804 of 0.6). We found that in all no-quality control conditions 805 we had a substantial number of contributors (N = 22) with 806 an average response quality below 0.6. In all other condi-807 tions combined, we found 11 contributors under this thresh-808 old. The proportion of underperforming contributors in the 809 none conditions is 0.42. Compared to the other conditions 810 with a proportion of only 0.05 this is value is extremely 811 high [68].

812
In the auto level of the quality control factor, an 813 ML-System predicted the response quality of contributors 814 based on two features (number of characters typed and 815 time needed to complete a request). To estimate the qual-816 ity of this prediction we calculated the correlation between 817 our ML-systems prediction and the average perceived qual-818 ity. The ML-system rated responses on a scale with three 819 ordered values (unacceptable (1); acceptable (2); good (3)). 820 As this scale is ordinal and violates the assumptions of Pear-821 son's Product-Moment correlation we analysed the correla-822 tion using Spearman's ρ. We found a substantial correlation 823 between the predictions and the average perceived quality 824 of our human judges ρ (937020) = 0.71, p < 0.001. The 825 correlation between the two human judges in comparison is ρ 826 (463061) = 0.85, p<0.001. In contrast, the human raters who 827 replaced the ML-system in our wizard condition achieved a 828 correlation of ρ (705574) = 0.78, p<0.001.

831
In our second experiment we investigated three main effects 832 1) quality control through the treatment variable (QA 833 Treatment), the task itself either addition or multiplication 834 (add/multiply) and increase in attention through bolding 835 action words in the task description (Bold). We also inves-836 tigated possible interaction effects between the significant 837 effects. The QA Treatment variable does not show a sig-838 nificant overall impact on the data set (see Table 9). The 839 Task variable encoded which task the user executed either 840 addition or multiplication had a significant effect as well as 841 the bolding of the verbs (add/multiply) in the task description. 842 Finally, showing the addition task before the multiplication 843 task (Addition first) also influenced the quality outcome. 844 Table 9 shows an analysis of variance to test for potentially 845 interesting effects and interactions.

846
As in the first experiment, the second experiment (image 847 recognition) again shows only minimal non-significant qual-848 ity differences between the three different quality control 849 conditions (QA Treatment). Table 10 shows the results of our 850 linear model for the three conditions.

851
A strong contributor to response quality was the task itself. 852 Contributor performance was significantly lower when com-853 pleting the addition task compared to the multiplication. The 854 reasons for this effect will be discussed in the conclusion 855 section in detail, but the primary reason was the (provoked) 856 misunderstanding of the task description [37], [38]. The 857 addition task has a ∼7% higher error rate than the multi-858 plication task (see Figure 5 and Table 11, which illustrated 859 99718 VOLUME 10, 2022 TABLE 9. The QA Treatment variable does not show a significant overall impact on the data set. The Task variable encoded which task the user executed either addition or multiplication had a significant effect as well as the bolding of the verbs (add/multiply) in the task description. Finally, showing the addition task before the multiplication task (Addition first) also influenced the quality outcome.

TABLE 10.
Feedback disabled (control group/intercept), QA Treatment 1) automated feedback enabled only in the first half of the task QA Treatment 2) automated feedback enabled only in the second half of the task.
TABLE 11. The addition task shows significantly lower response quality. The reason is a misinterpretation of the term ''add'' in the task description. Contributors were putting both numbers in sequence instead of adding the number. A 2 and 0 would be interpreted as 20 rather than 2.
that addition tasks are incorrect more often, but multiplication  Whether the contributor was asked to complete the addition, or the multiplication task first had a significant impact on the overall response quality of a contributor. The misconception from the addition task seems to carry over to the multiplication task in some cases. This observation is related to another significant effect in 876 the data. If the addition task is shown as the first task group, 877 the negative effect from the wording is carried through to the 878 multiplication task. The overall quality is reduced by ∼4% 879 when the addition task is shown first. This can also be seen in 880 Figure 6 (top), where when addition is shown first the success 881 rate of addition tasks is lower. Conversely, this is not present 882 in multiplication tasks. Using typeset Bold on the verbs (add/multiply) in the task 887 description did increase response quality for both task types 888 (see Figure 6 (bottom)). It also increased the response quality 889 equally for the order of tasks. The carried negative effect 890 of the addition was mitigated by the bolding of verbs. The 891 bolding increases the average performance by >6%. The 892 increase in quality is consistent across all other variables and 893 can be observed with almost the same effect size in all QA 894 VOLUME 10, 2022 FIGURE 6. Violin plots illustrating Task success rates according to whether addition or multiplication tasks were first (top) and whether keywords ''Add'' and ''Multiple'' were bold in the instructions or not (bottom). In both plots, the left two violins illustrate the general distribution for Add and Mult, and the black circle, the mean.
Treatment conditions and across all other factors. Table 13 895 illustrates these results. ity control mechanism vs. the complexity of the task to be 908 performed. We saw that more complicated tasks (text trans-909 lation) were not in need of more complex (e.g., human-based 910 or machine learning-based) quality control mechanisms.

911
In fact, we observed no statistically significant improvement 912 in response quality across the quality control mechanisms 913 applied. We did, however, observe a structural difference in  Our second experiment sought to build on and refine 920 these observations. We contrasted the effects of quality con-921 trol methods with the effects of subtle changes in the task 922 description and ordering of tasks. We again observed that 923 the presence of a feedback-based quality control mechanism 924 increases output quality in the image recognition and reason-925 ing tasks. We also observed the effects of very subtle changes 926 in the task description (boldening parts of the description 927 (add/multiply) or the ordering of the two different tasks (addi-928 tion first vs. multiplication first). We found that these subtle 929 changes in the task description and presentation have more 930 impact on response quality (7% and 4%) than the absence 931 of a control-based QA-Treatment (2% and 1.5%). This illus-932 trates that it is more effective to interact with constantly 933 underperforming contributors to understand the reason for 934 their actions rather than treating them as mere computational 935 elements; otherwise said, to support their intrinsic motivation 936 rather than enforce extrinsic motivation. The goal of quality 937 assurance measures should foremost be to understand possi-938 ble misconceptions rather than control of contributors. The 939 effects introduced by poor task design and task descriptions 940 do outweigh the impact of so-called ''cheaters''. In con-941 trast, interacting with these underperforming contributors can 942 enhance quality and satisfaction on both sides.

943
Returning to RQ1, we observed that only minimal non-944 significant differences between different quality control 945 mechanisms on response quality. We observed this in both 946 experiments, which capture a wide range of crowd tasks 947 (in NLP, and image recognition / processing). Similarly, 948 we observed that the number of underperforming workers 949 increases in the absence of any quality control announcement. 950 This is not surprising, however, in the case where quality 951 control is announced, but not performed, there was also 952 minimal non-significant differences to technically advanced 953 mechanisms of quality control.

954
For RQ2, we can (perhaps not surprisingly) note that task 955 complexity has an impact on response quality. Yet, it is sur-956 prising that increasing the level of sophistication in the quality 957 control mechanism for more complex tasks is less impactful 958 than the increase in task complexity itself. We also observed 959 that contributor performance was tightly linked to simple 960 design aspects of the task: making key words bold, task 961 ordering, and a small (yet still significant) impact based upon 962 when in the task quality control feedback occurs; we observed 963 a slightly reduced rate of error in tasks when feedback was 964 provided earlier in the task design.

965
From these findings, we propose the following suggestions 966 on how practitioners and the research domain can apply 967 quality control to reduce underperforming contributors with 968 a goal towards increasing response quality: 969 1) Mention quality control: The mention of a required 970 (qualification) test or similarly appropriate mechanism 971 (i.e., the fake level in experiment 1) is sufficient to deter 972 ''poor'' contributors. Using this alone, we observed an 973 increase of more than 25% in response quality.

998
In terms of training, we argue that the most basic way to 999 promote this in task learning is the interaction between con- impact on response quality than even sophisticated control-1006 based QA methods. Yet, even so we also know that it is harder 1007 to achieve high response quality in high complexity tasks.

1008
Thus, our suggestion is that instead of investing in complex, 1009 resource demand mechanisms for quality control (this is not 1010 a dismissal of research into mechanisms for quality control), 1011 we should rather seek to develop approaches to improve con-1012 tributor training and skill development to globally improve 1013 quality [1], [8], [17], [18]