Reliable data collection in participatory trials to assess digital healthcare apps

The number of digital healthcare mobile apps on the market is increasing exponentially owing to the development of the mobile network and widespread usage of smartphones. However, only a few of these apps have undergone adequate validation. As with many mobile apps, healthcare apps are generally considered safe to use, making them easy for developers and end-users to exchange them in the marketplace. The existing platforms are not suitable to collect reliable data for evaluating the effectiveness of the apps. Moreover, these platforms only reflect the perspectives of developers and experts, not of end-users. For instance, data collection methods typical of clinical trials are not appropriate for participant-driven assessment of healthcare apps because of their complexity and high cost. Thus, we identified a need for a participant-driven data collection platform for end-users that is interpretable, systematic, and sustainable —as a first step to validate the effectiveness of the apps. To collect reliable data in the participatory trial format, we defined distinct stages for data preparation, storage, and sharing. Interpretable data preparation consists of a protocol database system and semantic feature retrieval method to create a protocol without professional knowledge. Collected data reliability weight calculation belongs to the systematic data storage stage. For sustainable data collection, we integrated the weight method and the future reward distribution function. We validated the methods through statistical tests conducted on 718 human participants. The validation results demonstrate that the methods have significant differences in the comparative experiment and prove that the choice of the right method is essential for reliable data collection. Furthermore, we created a web-based system for our pilot platform to collect reliable data in an integrated pipeline. We validate the platform features with existing clinical and pragmatic trial data collection platforms. In conclusion, we show that the method and platform support reliable data collection, forging a path to effectiveness validation of digital healthcare apps.


Abstract
The number of digital healthcare mobile apps on the market is increasing exponentially owing to the development of the mobile network and widespread usage of smartphones. However, only a few of these apps have undergone adequate validation. As with many mobile apps, healthcare apps are generally considered safe to use, making them easy for developers and end-users to exchange them in the marketplace. The existing platforms are not suitable to collect reliable data for evaluating the effectiveness of the apps. Moreover, these platforms only reflect the perspectives of developers and experts, not of end-users. For instance, data collection methods typical of clinical trials are not appropriate for participant-driven assessment of healthcare apps because of their complexity and high cost. Thus, we identified a need for a participant-driven data collection platform for end-users that is interpretable, systematic, and sustainable -as a first step to validate the effectiveness of the apps. To collect reliable data in the participatory trial format, we defined distinct stages for data preparation, storage, and sharing. Interpretable data preparation consists of a protocol database system and semantic feature retrieval method to create a protocol without professional knowledge. Collected data reliability weight calculation belongs to the systematic data storage stage. For sustainable data collection, we integrated the weight method and the future reward distribution function. We validated the methods through statistical tests conducted on 718 human participants. The validation results demonstrate that the methods have significant differences in the comparative experiment and prove that the choice of the right method is essential for reliable data collection. Furthermore, we created a web-based system for our pilot platform to collect reliable data in an integrated pipeline. We validate the platform features with existing clinical and pragmatic trial data collection platforms. In conclusion, we show that the method and 1/33 platform support reliable data collection, forging a path to effectiveness validation of digital healthcare apps.
The increased interest in digital health has led researchers to study the effectiveness 19 of current apps. As a practical example, Pokémon GO, which is a game app based on 20 augmented reality, has positive effects on social interaction: 43.2% of people spent more 21 time with their family [52]. In contrast, a 6-week online intervention experiment has 22 revealed that Headspace, which is a healthcare app for mindfulness by guided 23 meditation, has relatively small effects on mindfulness [71]. The outcome contradicts the 24 results of previous randomized controlled trials for the app [44]. Because of such 25 contradictions, there is a need for a reliable mechanism to identify validated digital 26 health apps [62]. 27 Currently, there is no proper platform that evaluates the effectiveness of an app 28 using a systematic and objective validation method in digital health [62]. In particular, 29 mobile health apps have a very fast development cycle and are low-safety issue 30 technology, which tends to reduce the burden of regulation. As a result, these apps have 31 a direct trade characteristics between developers and end-users. However, the few 32 existing platforms are not suitable to evaluate the effectiveness of the apps in this case. 33 For example, review platforms and guidelines are emerging to solve the problem, but patients. At the last step of the project, the TLS clinical trial team designs the protocol 65 from the collected data of the participants. However, also in this case the creation of the 66 protocol is driven by an expert team, rather than common users. Accordingly, the 67 methods for data collection in clinical trials are not appropriate for participant-driven 68 assessment of healthcare apps (which are exponentially growing) because of their 69 complexity and high cost [24,94]. 70 A study on the real-world measure of the effectiveness of intervention in broad 71 groups is a pragmatic trial [28]. The data collection platform of a pragmatic trial 72 includes not only EDC but also specialized research tools and general surveys. These 73 data collection platforms can be web-based survey applications, or mailed 74 questionnaires, or specific healthcare apps developed from research kits [8,12,15]. 75 Therefore, while the platforms still have the issue of complexity, there is also the 76 possibility of collecting less reliable data. For instance, PatientsLikeMe is a platform 77 that shares experience data for participants to understand possible treatments of 78 particular disease conditions based on the experiences of others [101]. However, 79 PatientsLikeMe does not provide an environment for the public to lead the study 80 preparation, and the platform has no feature to evaluate the reliability of collected data. 81 Another example is Amazon Mechanical Turk (MTurk). MTurk is a crowdsourcing 82 marketplace for the recruitment of research participants and a platform for conducting 83 surveys [73]. However, the platform does not provide any standardized method at the 84 data preparation stage. In other words, the platforms need clinical experts to prepare 85 data collection procedures and system based on their knowledge. MTurk provides a 86 feature to approve or reject an individual worker, but the feature relies on a subjective 87 judgement with obvious objectivity limitations. We found that this platform has no 88 suitable method to measure the reliability of the data collected from the participants. 89 The participatory trial platform allows a public user to create and conduct the 90 participant-driven study to measure the effectiveness of products within daily life. The 91 core factors of the platform are simplicity, reliability and sustainability. Based on the 92 definition, we identified comparable platforms that have an alternative, similar or 93 essential feature. For example, Google Play and Apple App Store are alternative 94 platforms because they keep user ratings and reviews in the public domain [3,35,66]. 95 Both platforms have free text app reviews and scaled rating functions as a data storage 96 feature. However, the review feature has a natural language processing problem, which 97 is not structural data collection [61]. In other words, the platforms have no simple data 98 preparation method to create a systematic data collection protocol. In addition, the 99 platforms have a possible risk of transfer biases that could affect new reviews and 100 3/33 ratings, because they expose previously collected data to new participants [72]. Another 101 limitation of the platforms is that they do not offer any features to evaluate data 102 reliability. RankedHealth and NHS Apps Library can also represent similar 103 platforms [1,36,90]. RankedHealth has a procedure to minimize the influence of 104 individual reviewers for data reliability. NHS App Library publishes safe and secure 105 apps through questions designed by experts from technical and policy backgrounds. 106 However, these platforms have a limitation. The consensus of the experts is focused on 107 evaluating the technical performance of an app. Expert assessment does not validate the 108 effectiveness of user experience of the app. Thus, they are not appropriate for 109 participant-driven studies.

110
Finally, all the platforms we mentioned have limitations that do not reflect the 111 prevention of drop-out and software characteristics of digital healthcare apps. A study 112 with a daily collection of pain data using digital measures has shown that the 113 self-reporting completion rate is at an average of 71.5% (261/365) days. The latest 114 researches are attempting to develop an incentivized program or in-game rewards to 115 increase the self-reporting rates because sustainable data sharing for collecting large 116 amounts of data is a crucial function of the participatory trial platforms [43,49,92].

117
Furthermore, unlike drugs, functional foods, and mechanical devices that are difficult to 118 modify after the market launch, healthcare apps can be potentially updated as 119 software [26]. The app features may require iterative evaluation following the upgrade. 120 In summary, a new platform for digital healthcare apps should have a sustainable data 121 collection function.

122
Here, we propose a participant-driven reliable data collection method for 123 participatory trial platforms as a first stage to understand the effectiveness of 124 healthcare apps. The method consists of three steps: understandable data preparation, 125 systematic data storage and sustainable data sharing. We utilize a clinical trial protocol 126 database and a semantic relatedness method for the participatory trial protocol to 127 prepare data collection. We develop a data weight reliability formula that collects data 128 systematically. We propose a future reward distribution function connected to data 129 reliability weight for sustainable collection of data. In the results section, we describe 130 the following experiments to validate the reliable data collection method: comparison to 131 the simplicity of data preparation, data reliability weight validation and future reward 132 distribution effect observation. We report testing on a total of 718 human participants 133 for the experiments. The Institutional Review Board (IRB) of KAIST approved an IRB 134 exemption for the experiments. 135 Moreover, we have developed a web-based pilot platform accessible to the public 136 with real-world data as a crowdsourcing tool based on the citizen science concept. The 137 pilot platform systematically integrates all proposed methods. We conduct case studies 138 on the pilot platform to validate efficient recruitment. To demonstrate the advantages of 139 the proposed platform, we compare the platform functionality to existing platforms. 140 2 Definition of Participatory Trial 141 A participatory trial is an expanded form of a human-involved trial in which the public 142 is used to test the effectiveness of products within their daily life. The concept follows 143 crowdsourcing and citizen science in the aspect of data-driven science [40,53]. The  Table 1 [67]. We have defined a participatory trial to ameliorate current trial 151 definitions in modern society.

152
Limited to this manuscript, we also defined the meaning of a word or phrase as 153 following;

154
• Platform: Digital platform, that is an environmental software system associated 155 with each functional part of application programs.

156
• Data preparation: Determining the data collection structure and having the same 157 meaning as designing the protocol.

158
• Data storage: Evaluating collected data to store reliably and to determine the 159 value of the data.

160
• Data sharing: Integrating collected data to share with others and to provide 161 benefit with the data provider. interventions for over 70 years [9]. Focused on evaluating the efficacy of specific 166 interventions, clinical trials often use controlled settings and strict criteria for 167 participants and practitioners. The EDC system is increasingly recognized as a suitable 168 method that has advantages in real-time data analysis, management, and privacy 169 protection [99]. The system is most useful in trials with complicated contexts, such as 170 international, multi-centered, and cluster-randomized settings [96]. However, because of 171 their nature of strictness and complexity, clinical trials still have certain limitations in 172 the thorough evaluation of effectiveness or external validity [81].

174
Clinical trials are often time-consuming and challenging because they utilize rigorous 175 methods. These characteristics of clinical trials have increased the need for pragmatic 176 trials that seek to understand real-world evidence, assessing the effectiveness of the 177 intervention in actual clinical practice settings [28]. As introduced in the "PRECIS'' 178 assessment by Thorpe  and research kits, are used for these kinds of trials [8,12,15]. Consequently, pragmatism 182 is an emerging source of clinical evidence, from pediatric asthma and cardiovascular 183 diseases to monetary incentives for smoking cessation [4,36,85]. However, it is subject to 184 several challenges such as patient recruitment, insufficient data collection, and 185 treatment variability [28,91]. In addition, inconsistency in data platforms is another 186 major limitation of pragmatic trials [95]. We divided the data for evaluating the effectiveness of the healthcare app into the 189 stages of interpretable preparation, systematic storage, and sustainable sharing. We 190 used the essential methods of each stage to collect highly reliable data continuously. In 191 addition, from a participatory trial perspective, we have organized each method so that 192 the public could easily participate and organize their own research, provide reliable 193 information, and share it to induce continued interest ( Figure 1).

195
At the data preparation stage, we determined what kinds of data should be collected, 196 how to collect the data, and from where we would get the data. We had to consider the 197 data as an interpretable resource to determine the effectiveness of a healthcare 198 application. In the fields of clinical and pragmatic trials, researchers usually develop a 199 protocol to facilitate the assessment of a changing symptom or disease [16,93].

200
Healthcare applications also focus on improving a symptom or treating a disease. A 201 phenotype is an observable physical characteristic of an organism, including a symptom 202 and disease. Therefore, we can effectively measure the change of a phenotype from an 203 existing protocol that is semantically related to the phenotype. 204 We developed a protocol database system and a concept embedding model to 205 calculate semantic relatedness for a protocol retrieval system for interpretable data 206 preparation. The database contains 184,634 clinical trial protocols for 207 context-dependent protocol-element-selection [74]. We created the model to find a 208 vector on a latent space based on distributed representations of a documented method. 209 Furthermore, we calculated the semantic relatedness between input phenotype and 210 protocols with cosine-similarity [55,75]. Finally, we integrated the methods into the 211 proposed platform for the participatory trial, so that anyone can create a study to easily 212 determine the effectiveness of a healthcare application. Figure 2 shows the overall 213 procedure of the method. The participatory trial platform uses crowdsourcing methods to collect data.

216
Crowdsourcing refers to obtaining data and using it for scientific research through

Data Preparation
Data Storage Data Sharing Figure 1. Method overview The method consists of three parts: data preparation, data collection, and data distribution. In data preparation, a creator can devise a new study for the evaluation of an app by searching from the protocol creation method. Next, in data storage, the participants conduct a study, and the platform stores all the participants' replies. At the same time, it collects statistics of the replies and calculates the data reliability weight of each reply. After finishing the study, in data sharing, the platform calculates a future reward depending on each person's reliability of both the participants and the creator and distributes the reward to each of them.

Semantic search ex) C0018681
Symptom name

ex) Headache
Created protocol Protocols Preparation for data collection Protocol Creation Figure 2. Data preparation to collect The creator retrieves the previous protocol based on the symptom name of the app to verify effectiveness. Then the creator prepares to collect data on the retrieved protocol and creates the participatory trial protocol for data storing.

8/33
amount of data from the public at a low cost [48,51]. However, data collected from 220 crowdsourcing is often pointed out to be less reliable than results obtained by 221 systematic approaches [25,59]. Therefore, devising a method to measure data credibility 222 is the main challenge of the data reliability issue.

223
The primary purpose of this method is reliable data collection and delivery for 224 effectiveness evaluation, which is the next level on the path to validation. To achieve 225 this purpose, we needed a method to guarantee the reliability of input data. Therefore, 226 we proposed a formula to calculate the data reliability weight (DRW) through combined 227 effort measures of input data. providing accurate responses and correctly interpreting the data [42]. The recommended 232 composition of IER is response time, long-string analysis, and individual reliability.

233
Response time (RT) is a method to determine the effort based on execution time [104]. 234 Long-string analysis (LSA) uses intentional long-string patterns to assess the effort of a 235 participant [21]. Individual reliability (IR) divides a pair of data items into two halves 236 and uses the correlation between two halves to determine the normal response [83]. In 237 addition, Huang et al. also presented a single self-report (SR) IER item, which includes 238 empirical usefulness [41]. The SR is a single item for detecting IER, and SR consists of 239 the same 7-point Likert scale [65]. We developed the DRW calculation method based on 240 the IER methods.

241
We calculated DRW idx (X) according to the type of electronic case report form 242 (eCRF) for obtaining DRW (1). f RT , f LSA , f SR and f IR are the indexes of DRW. The 243 created protocol contains eCRFs, and an eCRF is a tool used to collect data from each 244 participant. We provided four types of eCRF as described in Supplementary Table 1. 245 We calculated DRW from the symptom and data type eCRF. The symptom type We calculated f RT and f LSA for symptom and data type of X. f RT calculates the 252 difference between the start and end time when the participant input the data to X.

253
f LSA uses the length input strings. On the other hand, we calculated f SR and f IR only 254 for the symptom type of X. We did not include the SR and IR item in the data type of 255 X to avoid distortion of the response by the participant on the iterative procedures. To 256 measure whether a participant is attentive while entering data at the end, we placed an 257 item of SR at the end of the symptom type of X. The measured pattern of participants 258 should be similar to the symptom type. To reflect the pattern for f IR , we calculated a 259 Pearson correlation coefficient [30]. r IR is the correlation value of Z and Z (2).
items belonging to X. The number of items in Z and Z is the same. We generated Z 262 and Z so that the correlation is close to one. z m k and z m k are the each item value of 263 X k . K is the number of participants, and k is kth participant of K.
Each index has its cutoff value and calculates the cutoff values for each prepared CRF in a study. In detail, we calculated the mean (µ) value and the standard deviation 266 (σ) to remove outliers. We removed an outlier if it was greater than µ + 3σ or smaller 267 than µ − 3σ. After removing the outliers, we calculated µ and σ . Then, we calculated 268 the cumulative distribution function (CDF) for the values of DRW indexes (3). The one 269 reason for CDF is to find a cutoff value that has the lowest area under the probability 270 density function among the input values. The other reason is to represent random 271 variables of real values whose distribution is unknown using the collected values [13].

272
Accordingly, we used the CDF values to find the µ and σ again to get a normal 273 distribution. On the distribution, we gained cutoff-value using the z-score for the 274 p-value (default is 0.05). h(X ) return 0 if the normalized input value is smaller than 275 the cutoff value, and 1 if the value is larger than the cutoff value (4). We calculated the 276 DRW indexes with the binary decision function.
We calculated DRW of a participant when we computed all the cutoff values of DRW 278 indexes for each CRFs (5,6). N is the number of CRFs assigned to a participant.

279
C DRWi is the number of DRW calculation item counts for each CRF type. Figure 3 280 displays an example calculation procedure for DRW.

282
A reward can drive active participation, which is an intuitive fact. Recent scientific 283 findings support the fact that a reward has an impact on data acquisition [29,49]. We 284 provided financial future rewards for sustainable data collection as exchangeable 285 cryptocurrency [50,76]. To give the reward, we developed a study result transfer 286 function that delivers statistical information on the completed study to an external 287 cryptocurrency system (Figure 4). The study participants periodically receive 288 cryptocurrency as a reward for their data. The total amount of the reward depends on 289 the external cryptocurrency system. However, the reward is an expected compensation 290 after the end of the study, not an immediate profit. Therefore, we tried to induce 291 sustainable participation with an expectation that active participation will earn high 292 rewards [97]. The expectation established a future reward distribution method to draw 293 continuous management motivation from the creator and the active involvement of 294 participants. 295 We collect rewards per each completed study. R total is 296 R creator + R participants + R system . R are rewards. R creator calculates the reward amount 297 based on the compensation proportion of participants and µ DRW value of the study (7). 298 If the µ DRW is low, this acts as a kind of penalty. We introduced this to encourage the 299 study creators to create a better plan and stimulate a study.
Here, R participant is that R total substrates R creator (8). K is the total number of 301 participants. DRW k is the calculated DRW for kth participant. The platform rewards 302 participants for their efforts to input data. A participant receives more rewards if he or 303 she made a more than average effort. Otherwise, the participant gets a lower reward.

304
R system acquire high compensation when the µ DRW is low. We added R system to 305 recover the cost of wasting platform resources when careless participants generated 306 unreliable data. We will use R system as a platform maintenance cost.
The proposed platform is designed for a participatory trial. The characteristic of 309 participatory trial in terms of data collection is that people can become researchers and 310 conduct experiments, systematically collecting data based on crowdsourcing and sharing 311 the results with the public [47,53].

312
To achieve this goal, we have developed a platform that integrates all the methods 313 presented above, so that each feature is systematically connected to another. The    Figure 5. Working logic overview of the platform The platform integrates and connects methods for reliable data collection at all stages of study creation, study conduct, and study analysis. In the study creation stage, the creator can create, modify, and delete the study. When the creator activates the study, it is open to other participants, and they can participate. In the study conduct stage, the creator can collect the data from the participants and can monitor them. In addition, the creator and participants can monitor statistical information related to the study. In the study analysis phase, all data generated after the study is completed will be disclosed to the study creator, the participants, and the anonymous users. This makes data available to the experts and the public. The platform services allow users to prepare, collect, and distribute reliable data at each stage in the integrated environment.
users can join the study. Participants of the study will be curious about the effectiveness 319 of healthcare apps, so that they are more serious about entering data. Research data 320 and basic information about the study are stored in the blockchain to prevent data 321 falsification in vulnerable systems [76]. The study ends after the predefined duration of 322 the research. Even if the number of participants were not enough to develop a sufficient 323 amount of data, the study would be closed at the pre-set endpoint. The system then  Technically, we designed the pilot platform according to the interface segregation 328 principle [103], which provides a user with an tool that uses only the services that the 329 user expects. The platform consists of three layers: web interface, API engine, and 330 database. The functions of each layer are based on Node.js, Java programming language, 331 and PostgreSQL database. In addition, essential functions and fundamental database 332 schema related to the EDC system are designed using OpenClinica 3.12.2 [14]. We also 333 inherited the EDC function to receive data in the Operational Data Model (ODM)

334
XML format that is compatible with the Clinical Data Interchange Standard 335 Consortium [10,33]. Finally, we integrated the blockchain system to prevent data 336 falsification for the crowdsourced data collection platform in participatory trials [76].

337
See the Supplement document for more information on the platform development. stage, DRW validation in the data storage stage, and the influence of expected rewards 344 rather than immediate rewards in the data sharing stage. All participants filled and 345 signed an electronic consent form. In the quantitative measurements, the participants' 346 data input was for the purpose of method validation. We did not collect any personally 347 identifiable information during the tests. We used qualitative measures to demonstrate 348 the benefits of a participatory trial, as compared with clinical and pragmatic trials. 349 6.1 Comparison to the simplicity of data preparation 350 We developed a protocol retrieval method based on the clinical trial protocol database 351 and validated the method in our previous studies [74,75]. The methods are the core 352 function of data preparation. In detail, we validated semantic filters of the database 353 that can provide more appropriate protocols than a keyword search [75]. We used 354 clinical trial protocols and corresponding disease conditions as extracted golden 355 standard set from clinicaltrials.gov [107]. Our F-1 score was 0.515, and the score was However, previous results were limited by providing tests solely for expert views.

364
Also, we conducted the results for a clinical trial, not a participatory trial. In contrast, 365 the interpretable data preparation method we proposed is a publicly available system to 366 apply the above technique from the perspective of a participatory trial. Therefore, we 367 need further validation of whether the proposed method is sufficiently simple for anyone 368 to create an interpretable protocol to prepare data collection.

369
For the validation, we designed an experiment that collects Usefulness, Satisfaction, 370 and Ease of Use (USE) score from human participants to reflect their simplicity 371 experience of the different methods [60]. We recruited human participants who 372 represented general users using MTurk [73]. We used 1.1 expected mean and 1.14 373 expected standard deviation (SD) of the previous USE study [31]. We set a power of 374 95% and a one-sided level of significance of 5% to calculate the number of participants. 375 The number of participants was 20 by adjusting the sample size for t-distribution [23]. 376 The participants of the experiment compare the data preparation function, which is 377 related to protocol creation between the proposed method and the existing expert 378 method. We prepared the data preparation method as a simulation program (PM), and 379 the complicated method used Openclinica 3.14, an open-source EDC program (OC) [14]. 380 Two programs recorded randomly generated ID of participants to identify their test 381 completion. The participants respond to the USE questionnaire, which consists of 30 382 items of 7-point Likert scale for the comparison score and 6 short items for their opinion 383 after each use [60]. We reversed the 7-point Likert scale to make participants 384 concentrate more on the questionnaire [11,41]. We restored the modified scores in the 385 result calculation. In the experiment description, we only provided an elementary data 386 preparation guide and no detailed manual. We also divided participants into two groups 387 to prevent recall bias occurring in the order of the program usage [72]. We reversed the 388 order of use for each group. Finally, we configured the hardware environment identically 389 on the cloud-computing resources of Amazon web services to only identify software 390 differences [17].

391
After completing the experiment, we excluded participants who did not complete all 392 USE items, had all the same item values, or skipped the test of method systems. 393

14/33
Consequently, we could obtain forty-two participants from the combined data set of the 394 two groups. Supplementary data We conducted the descriptive statistics of both 395 methods as four divided dimensions [31]. Accordingly, the 30 USE items broke down as 396 follows: 8 usefulness (UU) items, 11 ease of use (UE) items, 4 ease of learning (UL) 397 items, and 7 satisfaction (US) items. The statistics indicated that the proposed method 398 scores were higher than 5 for all USE dimensions ( Figure 6 and Table 2). We also found 399 that scores showed negative skewness for all aspects.

400
As described in Table 2, the average USE score of PM was 5.548 (SD=0.897), and 401 OC was 4.3 (SD=1.5). We conducted a statistical test to confirm a significant difference 402 between the average scores. To check the USE score's normality, we did the 403 Shapiro-Wilk test [87]. The test result p-value of our score was 0.082, and the compared 404 score was 0.238, which satisfied normality (p ¿ 0.05). Therefore, we did the paired 405 sample t-tests, and the USE score presented significantly higher value in PM than OC, 406 with t = 7.243, p <=.001. Since the four dimensions in USE did not satisfy the 407 normality, we conducted the Wilcoxon signed-rank test [102]. The outcome of the test 408 showed noticeable differences between the two methods in all dimensions (Table 3). 409 Besides the results, we also found that the participants completed 100% of tasks on 410 PM, while they only generated basic metadata of given tasks on OC. Thus, we concluded 411 that PM is more suitable for understandable data preparation in participatory trials.

Figure 6. Whisker box plots of the dimensions in USE between two groups
The proposed data preparation method (PM) showed higher records than the existing expert method (OC) in all domains of tests.

414
DRW is a weight that is calculated by incorporating a score to measure the effort spent 415 when a user enters data. We examined the correlation between the Human Intelligent 416 Task Approve Rate (HITAR) of MTurk and DRW to assess whether this score is 417 measured well. The HITAR is a score that evaluates how well a worker is performing 418 tasks assigned by the requestor on MTurk. Consequently, we assumed that a higher 419 HITAR would lead to a higher DRW. To verify this, we performed the following tests. 420 We prepared the first test to confirm that the DRW indexes correlated to HITAR.

421
DRW indexes include response time (RT), long string analysis (LSA), individual 422 reliability (IR), and a self-report (SR) [41,42]. We intended to organize DRW with 423 empirical DRW indexes for simple calculation in eCRF. Because DRW is a subset of 424 IER, we made a questionnaire to calculate DRW based on detecting IER research [41]. 425 We prepared a questionnaire to contain 60 items of the international personality item 426 pool-neuroticism, extroversion, and openness (IPIP-NEO)-to calculate IR [45].We 427 also added 8 items of the infrequency IER scale to increase the concentration of 428 participants at the beginning of the test and SR item for DRW score [65,77]. We placed 429 a self-input HITAR item box at the start of the questionnaire as LSA item. Thus, the 430 questionnaire included 70 items. To calculate the total sample size, we used 0.13 431 expected mean of the paired differences based on the result of our pilot DRW test with 432 a total of 100 participants in MTurk. We set a power of 95%, a one-sided level of 433 significance of 5%, and equal group size for sample size calculation [23]. The calculated 434 sample size was 153 for each group, and the total size was 306. We recruited the 435 participants using MTurk. Workers of MTurk participated in the study as participants. 436 We divided all participants into two groups, with HITAR levels above 95 and below 90. 437 Participants over 95 HITAR needed a record of more than 500 tasks before  because the first granted HITAR is 100, and it is a deduction method when the task is 442 not performed well. Participants read the task description posted on MTurk. All 443 settings of the test were identical except for the HITAR levels. The endpoint of the test 444 was set to the recruitment completion time of one of the groups. indexes and DRW, groups with more than 95 HITAR scored higher than the other 450 group (Figure 7 (a)). In more detail, we did a statistical analysis to understand the 451 difference between the two groups. through DRW calculation method. We checked that only passes a range of values with 456 approximately 0.5 or more. We verified statistical significance between the mean values 457 of the two groups using the independent samples t-test. We performed Levene's test to 458 consider unequal variance in the t-test [58]. The p-values indicate that the results are 459 statistically meaningful. Only SR had similar values in the two groups. We presumed 460 that the mean value difference of SR was small, but SR would have a small effect on the 461 calculation of DRW. Furthermore, we calculated effect size, an indicator of how much 462 difference there is between the two groups [64]. Since we are comparing the same sample 463 size between the two groups, we calculated the effect size using Cohen's d, and the 464 result was close to a large effect (.8) according to the interpretations of Cohen's d [18]. 465 Next, we compared the correlation between DRW values and HITAR. The obtained 466 HITAR values presented high skewness (-2.138) because the HITAR average we 467 obtained had a relatively high value of 83.91. To improve the performance of the data 468 analysis and interpretation, we established a binning strategy. We thus divided the data 469 group by the same frequency, and we smoothed the grouped data values to the average 470 value. Specifically, the actual data processing is as follows. We sorted the number of 340 471 data of participants based on descending order of HITAR values. We grouped 10 as 472 frequency and averaged the HITAR, DRW, IR, SR, LSA and RT values. Based on the 473 18/33 groups, we ranked HITAR and took the reverse order. 34 Participants did not enter 474 HITAR values correctly, and 6 data of participants remained as ungrouped. Accordingly, 475 we removed the 40 data, and we placed 300 data into 30 groups. To check the 476 effectiveness of the binning, we used an information value (IV) that expressed the 477 overall predictive power [86]. The measured IV value (0.465) was greater than 0.3, and 478 we determined that the binning was good. Based on the binning, collected data showed 479 significant results that indicated a noticeable correlation between DRW and HITAR 480 (Figure 7 (b)-(g) ). DRW and HITAR also presented an interesting correlation score 481 (r=0.696, p ¡ 0.001). In summary, we validated that HITAR correlates with DRW and 482 DRW can be used as a reliability measure.  Recent studies have confirmed that systems with immediate virtual rewards have a 485 positive effect on continuous data collection [68,78]. However, we evaluated the validity 486 of a reward distribution that gives rewards to participants in the future, rather than 487 immediately. We showed the validity as correlation between rewards and DRW. 488 Accordingly, we evaluated the following scenario, in which the future reward system of 489 our platform increases the participation rate. 490 We created a simulation environment in MTurk to reduce the observation time in 491 the real world. We agreed that a real-world study of the reward distribution effects 492 would require significant time to collect the data and cases. For example, 493 PatientsLikeMe required approximately 5 years to collect enough cases to show the 494 benefits to communities (which were related to the extent of site use) [101]. Thus, we 495 conducted two tests in the simulation environment. We designed one test that informed 496 about reward distributions in the future (the same as our platform). The other test did 497 not contain this information. The remaining settings were unchanged. For the sample 498 size calculation, we conducted a pilot test with 100 participants in MTurk. We obtained 499 0.11 expected mean of the paired differences and 0.25 SD on the result of the pilot test. 500 Based on the result, we calculated the sample size as 168 per each group to have a 501 power of 95% and a one-sided level of significance of 5% [23]. The total sample size was 502 336. Then, we used a questionnaire as data reliability validation and weights for 503 19/33 consistency comparison, except for the HITAR. However, we needed to consider the 504 casual observation characteristics of MTurk workers since this experiment had no 505 constraint on HITAR [5]. Therefore, we added five questions that induced to enter as 506 many words as possible in the LSA index of DRW [42].  workers to the test that contained future reward information (RI) and the other 168 to 509 the test without future reward information (NRI). We designed the test to be completed 510 within two hours, and we recorded the test start and end times for each worker as RT. 511 Consequently, we successfully collected data from 336 workers (Supplementary data 3). 512 First of all, we analyzed the effect of RI on the self-reporting rate by a two-way ANOVA 513 test. Table 6 indicates that the reward condition has interaction with consecutive LSA 514 items and is statistically significant at an alpha level of 0.05 (p = 0.008). In other words, 515 we found that continuously containing long characters in consecutive LSA items on the 516 RI condition, and we inferred that the RI condition improves the rate of self-reporting 517 than NRI condition. Next, we found that the average DRW of RI was 0.605, and the 518 DRW of NRI was 0.472 ( Figure 8). Table 5 shows the significant difference between 519 DRW of RI and NRI (p=0.03). Figure 8 presents that DRW(SR), DRW(LSA), and 520 DRW(RT) of RI also have higher average values than the DRW indexes of NRI, and 521 Table 5 provides detailed statistical information. DRW index values in Table 5  We developed "Collaborative research for us" (CORUS, https://corus.kaist.edu) as 534 a pilot data collection platform for participatory trials. CORUS is designed based on the 535 EDC for existing clinical trials, but it makes it easier to design a study and provides 536 services that allow users to participate voluntarily and continuously in the data 537 collection for the study. In this section, we compare and analyze the features of CORUS 538 in detail. For specific comparisons, we selected the programs mentioned or used in 539 journals related to clinical research from 2012 to 2019, and we introduced them in the 540 introduction section. We summarize the comparison results in Table 7 and provide the 541 details in the following subsections. CORUS has a protocol creation feature as interpretable data preparation that makes it 544 easy for the public to use without the help of experts. In the platform, the creator can 545 utilize the clinical trial protocol database feature according to the semantically related 546 symptom of the effectiveness of healthcare apps. Moreover, the creator can get feedback 547 from the study participants within the community feature, which allows the creator to 548 enhance the protocol in the future. The protocol creation feature of CORUS provides understandable and straightforward 551 data preparation for public users. The system features of CORUS helps the users lead a 552 study without expert knowledge. Moreover, CORUS has the DRW feature that 553 automatically calculates the reliability scores collected data for reliable data collection. 554 The scores are an objective value that does not comprise the subjective intention of the 555 study creator. We also developed the scores to evaluate the effectiveness score of apps in 556 further study of data storage phase. We developed CORUS as a participatory trial platform. Data preparation features of 559 CORUS prepare data collection that can perform immediate data analysis.

560
Postprocessing for unformatted data is not necessary on the features. The features 561 minimize the flawed study design bias [72]. Besides DRW, data falsification prevention 562 feature prevents transfer bias by not exposing the collected data to the participants 563 during a study. CORUS, a participant-driven platform, also supports additional analysis 564 for experts. Standardized data distribution is an essential feature for validating the 565 effectiveness of an app. Cryptocurrency is a distinctive participation reward feature of 566 CORUS. We connected CORUS to an external blockchain to encourage participants to 567 share their data continuously. Participants can earn a future reward based on their data. 568 CORUS calculate the participant portion of total cryptocurrency of a study based on  Program Name Rave EDC [34] Transparency Life Sciences [57] Trial Chain [105] PatientsLikeMe [101] Amazon Mechanical Turk [73] Google play store / Apple app store [3,35] NHS / Rankedhealth [36,90] CORUS Data preparing Clinical protocol database intentionally or unintentionally during a study, exists in all platforms [32]. Trialchain 574 used blockchain to solve the problem [105]. We classified Trialchain as EDC based on 575 the explanation of the platform. Thus, we considered that Trialchain is difficult to use 576 in a participatory trial. CORUS also uses Blockchain technology. We developed a data 577 falsification prevention feature based on the data immutability characteristics of 578 blockchain technology [76]. The feature solves the problem of data falsification. We conducted an operational test to observe the data collecting capabilities of the 581 platform in actual participatory trial conditions. The objective of the test was to 582 confirm whether an app that enables blue light filters on smartphones would help to 583 improve sleep disorders. Although there have been previous studies that showed the 584 relation between blocking blue light and improvement of certain sleep disorders, to the 585 best of our knowledge, it has not been clinically confirmed whether the blue light filters 586 on smartphones or other mobile devices have the same effect [27]. The test was designed 587 to recruit 100 participants. Based on the collected data, we could not find the 588 effectiveness evidence for the app. 589 We designed the participatory trial project and posted it on the test platform. In the 590 trial, participants were asked to apply the blue light filters (which blocked 591 blue-wavelength lights on their smartphones) and report changes in the quality of their 592 sleep. When users selected a specific project on the platform, introductions to the 593 projects by the creators were shown. This made it convenient for users to participate in 594 their preferred projects. Participants enrolled in the project could immediately 595 participate in the trial and conveniently report data. The platform also encouraged 596 participants to report their data daily by using features such as leaderboards and 597 alarms.

598
In addition to technical validation, we also validated whether the system could 599 effectively recruit participants to the trial. The duration of participant recruitment was 600 originally scheduled for a month, but the actual rate of recruitment was faster than that. 601 In total, 100 participants were recruited in 21 days. We compared the rate of 602 recruitment in this experiment with those known from previous studies. To denote the 603 rate of recruitment among the trials, the recruitment rate was defined as the number of 604 participants recruited per month (or participants per center per month, in the case of 605 multicenter trials). In 151 traditional clinical trials supported by the United Kingdom's 606 Health Technology Assessment program, the recruitment rate was 0.92 [98]. According 607 to 8 web-based or mobile app-based studies collected from literature databases, an 608 average of 468 participants was recruited over 5 months (recruitment rate = 93.6) [54]. 609 For the platform, the recruitment rate was 142.8. This shows that participant 610 recruitment using the platform was significantly more effective than participant 611 recruitment in traditional clinical trials; it was also competitive with other web-based or 612 mobile app-based studies. At this stage, which was the beginning of the study period, 613 the response rate also tended to increase over time. The response rate increased to 120% 614 on recruitment days 17-18, when many new participants were enrolled, and the 615 response rate remained high for a significant period of time ( Figure 9).  [72,89]. At the stage of data preparation, we included flawed study design, 634 channelling bias and selection bias. We tried to prevent flawed study design by using the 635 existing clinical trial protocols. We minimized channelling bias, which leads to including 636 only specific participants, by placing unconditional eligible criteria that anyone can 637 participate in a study. The editable eligible criteria of a creator such as gender and age 638 is still a matter to consider for the channelling bias. However, the selection bias, which 639 can distort the data tendency, is not considered. This is a difficult problem to solve in 640 the field of web-based data collection [7]. As a solution, we propose the social 641 relationship distance estimation method to keep participants beyond a certain distance. 642 A social network estimation method of participants to calculate a closeness rank among 643 participants is an example method for cutting off a certain distance or less [82]. At the 644 stage of data storage and data sharing, we included interviewer bias and transfer bias. 645 The interviewer bias refers to potential influence on the participants deriving from the 646 data collector. Our data storage method does only involve mechanical data collection 647 functions, so we assumed interviewer bias not to occur. However, we do not have any 648 preventive measure against interviewer bias caused by the study description or the CRF 649 items that a creator manually generates. The solution is to clearly define the digital 650 biomarkers that a creator seeks to evaluate the effectiveness of a digital healthcare 651 app [20]. Further research and consensus from experts are needed to enable it for 652 standardized participatory trial protocols with digital biomarkers. The transfer bias in 653 our method is caused by data exposure at the data storage stage and by possible 654 influence by the stored data on other participants. We prevented this source of bias by 655 preventing the stored data from being exposed during a study. Finally, we did not 656 consider the bias generated at the stage of data analysis because it is out of the scope of 657 this article, which is only dealing with data collection methods.

658
There are additional considerations that may improve our platform. First, the 659 current platform cannot provide a quantitative comparison of the search results of the 660 clinical trial protocol. The platform takes a symptom from the creator and provides all 661 relevant clinical trials. However, many clinical trials compare as a search result, 662 24/33 additional analysis from the creator's side is required. In clinical trials, various types of 663 data-not only symptoms but also genes and chemical compounds-come up. This 664 information can be used to retrieve clinical trial data required by the user. If an 665 advanced platform is developed in the future that can quantitatively assess the 666 relevance of all data involved in the clinical trial, it will reduce the additional analysis 667 burden for the creator. Second, advanced DRW methods should be developed. For 668 example, the basic DRW can be used to assess the reliability of the input data based 669 only on the length of the input text, as in long-string analysis (LSA). In other words, it 670 cannot consider semantic errors in the input text. Therefore, an advanced method that 671 considers semantic errors in the text is required. In addition, we have a cutoff value 672 calculation problem when the SD is 0 from the collected data of a DRW index. We 673 prevented the problem with a predefined minimum cutoff value on the platform, but we 674 need to improve this approach so that we can actively find cutoff values suitable to the 675 collected data. Third, we will devise an optimal strategy to reduce the cost of running 676 the platform. Fourth, an additional study is required to verify the platform. We plan to 677 conduct a small-scale pilot study to check whether the study results are the same as the 678 verified effects.

679
All these limitations should be considered for further experiments or improved 680 computational methods. Moreover, we suggest further studies since the data collection 681 methods are the basics of the digital healthcare field, which is under tremendous 682 progress. The suggested studies are the following: (1) Developing a standard for a 683 participatory trial protocol method considering digital markers and confounding factors, 684 (2) Advancing protocol creation methods for the determination of specific parameters, 685 such as the number of uses, effective report counts, study duration or the number of 686 participants, (3) Using natural language question models to generate DRW index 687 questions rather than human-managed templates, (4) Optimizing data reliability weight 688 calculation with the importance value of each index, (5) Implementing an effectiveness 689 score calculation method based on the collected data with DRW scores. With these 690 further improvements and further studies, our platform could be used as a valuable tool 691 to assess the effectiveness of various apps in a cost-effective manner. At the first stage of the verification workflow, we proposed the reliable data collection 694 method and platform to assess the effectiveness of digital healthcare apps. We presented 695 a participatory trial concept for a new type of data collection trial that uses voluntary 696 participation of end-users of digital healthcare apps. Then, we described the essential 697 methods of the reliable data collection platform based on the participatory trial concept. 698 Interpretable data preparation methods consisted of a protocol database and a retrieval 699 system to create a protocol without expert knowledge. We validated the simplicity of 700 the methods from compared USE scores between the proposed system and the existing 701 system. We developed a DRW calculation method for systematic data storage. The

702
DRW score showed to be a reliable measure through correlation with HITAR. To 703 achieve sustainable data collection and sharing, we developed a reward distribution 704 method. We indirectly observed the effect of reward using DRW. The effect of reward 705 presented increasing DRW, i.e., it increased the participants' effort toward sustainable 706 data collection. Finally, we implemented a pilot platform, CORUS. CORUS integrated 707 all the methods and essential features for a participatory trial platform. We compared 708 its features with existing platforms in the field of clinical trial, pragmatic trial and 709 participatory trial. We assert that CORUS has all the necessary features to collect 710 reliable data from the users of digital healthcare apps on the path of validation of their 711 effectiveness.

712
Code Availability 713 We provide all the source code at https://github.com/junseokpark/corus upon the 714 publication. The platform is open-source to promote development in the public domain. 715 The repository describes software requirements and distributes a README.MD file. 716