Sentinels and Twins: Effective Integrity Assessment for Distributed Computation

Distributed computing supports large scale and data-intensive computations with the cooperation of a multitude of parties, each responsible for a portion of the workload. Such parties are often not fully reliable and may return incorrect results. In this article, we address the problem of assessing the integrity of the computation results. We provide a comprehensive characterization of two techniques, sentinels and twins, evaluating their effectiveness and synergy. Sentinels are pre-computed tasks whose result is known apriori, and enable checking returned results against a ground truth. Twins are replicated tasks assigned to different workers, and enable cross-checking returned results for a same task. The analysis considers many questions that arise in the design of a concrete integrity assessment strategy and identifies the parameters that have a critical impact on the overall protection. Our model enables to tune the integrity controls so to achieve best effectiveness. The model can be applied to a variety of scenarios and offers guidelines that can find extensive application.


INTRODUCTION
D ISTRIBUTED computing has become the norm for the management of large computational problems, which can be decomposed in multiple sub-problems, each assigned to a different device. A demonstration of the importance of representing large-scale computations as a large number of independent tasks is the success of modern MapReduce architectures, like Apache Spark. A particularly important application of this paradigm occurs when machine learning is integrated with big data, which is arguably the topic in the IT domain that is currently receiving the greatest attention. A common feature of this integration is the need to extract knowledge from large collections of data, using an approach where initially a model is built in a training phase and then it is applied over an extremely large number of instances in the prediction phase, where each instance is classified. The computational requirements of the prediction phase can be extreme. For instance, in environmental monitoring it is today possible to use a large number of sensors (e.g., microphones and cameras) collecting large volumes of data, which are analyzed to identify specific subjects (e.g., elephants in the African jungle [1] or snow leopards [2]). Hospitals want to apply image analysis techniques to large collections of medical images. In all these cases, there is the need to process millions or even billions of jobs, requiring the use of large infrastructures. The motivation for outsourcing computations involving external parties (workers) can be both the need of high-performance computational capabilities and economic convenience.
A clear concern in such distributed outsourced scenarios is the lack of control over the jobs' computation and hence the uncertainty about the correctness of results returned by the different workers in the system. While one may assume an overall proper behavior, the open nature of the system is clearly vulnerable to possible misbehavior by workers, which can be either sloppy in their operation, or -even worse -intentionally misbehave (to get rewards without employing needed resources), and therefore opportunistic in their responses.
The problem is well known and recognized by the research and industrial communities, which have devoted attention to the development of techniques to assess integrity of the results of computations outsourced to external parties. A promising approach to assess integrity in contexts where computations are not fixed and predefined (and therefore authenticated data structures providing deterministic integrity guarantees are difficult to use) relies on probabilistic techniques. These can always be applied when each portion of the problem assigned to a worker can be structured as a collection of jobs, each producing a result. It is then possible to inject jobs against which the behavior of workers is controlled. Most common probabilistic techniques either (a) insert jobs whose result is known a-priori, alerting for violations whenever results are different from the known one, or (b) replicate jobs to multiple workers, alerting for violations whenever results from different workers in response to the same replicated job differ. While the two techniques are known and well recognized, the problem of their targeted generation and combination, so to provide best effectiveness for integrity guarantees, is still an open issue.
We address this problem and propose a model to reason on the combined use of pre-computed and replicated jobs, which we call sentinels and twins, respectively, so to provide best effectiveness. We frame our work in the context of a data classification problem, which allows us to capture a variety of scenarios, and investigate different issues that naturally arise in the application of such controls. Our investigation produces an improved characterization of each technique and provides a response to many questions: What are the aspects that have an impact on the effectiveness of sentinels and twins? How should sentinels be generated and twins be distributed among workers so to provide best integrity guarantees? How many replicas should be used to get the best effectiveness? When are twins more effective than sentinels (or vice versa)? Given an application domain, what is the combination of sentinels and twins able to provide the best protection? The results of the investigation show the effectiveness of the two techniques when carefully combined based on our findings. Our work represents then a reference for the application of probabilistic techniques, supporting the realization of efficient integrity assessment solutions for many domains.

SCENARIO AND BASIC CONCEPTS
Our problem is to enable a client to outsource data processing to possibly untrusted workers in a distributed system, while enjoying integrity guarantees on the returned results. There can be multiple reasons for a computation result to be incorrect: a defect, a temporary misconfiguration, or a malicious action by the worker (which may want to either sabotage the computation or get the reward for computing jobs while omitting to do so). From an integrity-assessment point of view, there is an integrity issue regardless of whether the incorrect result has been caused by failure, malfunctioning, sloppiness, or intentional opportunistic behavior since they all have the effect of the worker not correctly computing the jobs assigned to it. In our analysis, we consider the problem of detecting misbehavior of intelligent workers, which intentionally omit computation and behave opportunistically in their responses to avoid being detected in their omissions. The reason for considering intentional misbehavior is to set our work in the worst (more difficult to discover) scenario. Indeed, techniques that are able to withstand the action of an intelligent worker would also offer integrity guarantees when the violation is produced by accidental anomalies. In this way, we also cover the case of anomalies that exhibit a behavior that may otherwise be hard to detect (e.g., when a defect produces as a result the most common answer expected from the computation).
In our scenario, we assume workers to be computers executing a deterministic program. For concreteness, we consider a classification problem computing, for each data item in a collection, the class associated with it. The consideration of a generic classification problem allows us to capture different application scenarios characterized by processing tasks producing results in a finite domain of values. Common computational tasks can be considered as classification jobs with an extremely large number of classes. For instance, the prediction phase in machine learning can be seen as a classification problem, whose goal is to classify each instance according to the model defined in the training phase. Formally, we define a classification problem as follows.
Definition 2.1 (Classification). Given a set D ¼ fd 1 ; . . . ; d n g of data items and a set C ¼ fc 1 ; . . . ; c c g of classes, a classification is a function g : D ! C that assigns each data item d i 2 D to a class c k 2 C.
A classification function g can be characterized by a probability mass function P D that describes the probability of a data item to belong to each class in C. For concreteness, in the paper, we refer our examples to a classification over a set C = {c 1 ,c 2 ,c 3 ,c 4 } of four classes and three representative instances of probability mass function (Fig. 1). The considered distributions follow a Zipf's law with: a ¼ 0, modeling uniform distribution; a ¼ 1, representing the distribution of classes known to be common in many domains; and a ¼ 7, representative of a very skewed distribution. We focus the initial analysis on classification jobs with a limited output domain, because in these cases it is easier to characterize the distribution of output values. This is not a limitation since, as our analysis will show, the number of classes has negligible impact on the effectiveness of the techniques, which depends instead on the probability of the most frequent class (regardless of the number of classes). The results of our analysis apply then to generic classification problems and data distributions, including those with only two or an extremely large number of classes (with this latter capturing generic computations with a wide variety of possible results).
We assume the classification problem to be deterministic and complete. Deterministic means that we assume a diagonal confusion matrix and an accuracy of 100% in the classification process (i.e., we assume workers to be machines running the same algorithm, in contrast to human beings performing a task). Then, an incorrect result is due to an incorrect (defective, sloppy, or malicious) computation. For instance, with reference to machine learning, the execution of a prediction job deterministically returns a class, as it returns the deterministic result produced by the application of the model obtained in the training phase. This holds also when the returned result may not enjoy perfect quality. As an example, for the problem of identification of snow leopards in camera images, the model may sometime miss the presence of the leopard, but this does not affect the determinism of the model, and hence of the result of the job execution. Completeness implies that every data item is associated with a class in C. Note that a partial classification problem can be made complete by simply introducing a 'dummy' class to which data items not of interest for the classification will be mapped.
In the following, we refer to the computation of the classification of an individual data item as a job (short for classification job). Outsourcing of jobs to external, possibly untrusted, workers is then formalized as an allocation function defined as follows.
Definition 2.2 (Job Allocation). Given a set D ¼ fd 1 ; . . . ; d n g of data items on which a classification is to be computed, and a set W ¼ fw 1 ; . . . ; w w g of workers, an allocation is a function v : D ! W that assigns each data item d i 2 D to a worker w k 2 W .
We use dðwÞ to denote the set of data items assigned to worker w. For simplicity, we assume an even distribution of jobs across workers, that is, the number of jobs assigned to any two workers differs of at most one. Formally, 8w 2 W : bgc jdðwÞj dge, with g ¼ j D j j W j . Also, for simplicity in the formulation, we assume the number of data items in D to be a multiple of the number of workers, and therefore that all workers have exactly the same workload. This is not a limitation: a model with a heterogeneous assignment of jobs would make the analysis more complex, but it would produce identical results with respect to the efficiency and effectiveness of the integrity assessment techniques.
We use g i to denote the classification performed by a worker w i , and g W to denote the classification performed by all workers (i.e., the union of the classifications computed by all the considered workers).
Our problem is then allowing the client to assess integrity of classification g W . We distinguish between honest and lazy workers. An honest worker correctly performs jobs assigned to it. A lazy worker omits some of the jobs assigned to it, returning for them a result that is freely chosen by the worker. Lazy workers can however possibly behave opportunistically on the omitted jobs to the aim of not being discovered (e.g., exploiting possible knowledge on the classification job). For simplicity, we assume lazy workers to be independent entities and to not communicate with each other. The impact of collusion would only be a reduction in the effectiveness of twins (see Section 7).

INTEGRITY CHECKS
Our work is based on the inclusion, in the jobs to be outsourced, of additional jobs that will serve as checks for integrity assessment (Fig. 2). For such additional jobs, we consider two complementary approaches: i) insertion of jobs whose result is known a priori; ii) replication of the same job to different workers. These two approaches well reflect state of the art techniques typically considered for integrity control. Controls of the first type correspond, for example, to sentinels and markers (e.g., [3], [4], [5], [6]), and watermarks (e.g., [7]). Controls of the second type correspond to classical replication (e.g., [8], [9]). In our work, we refer to jobs of the first kind as sentinels and to jobs of the second kind as twins, characterized as follows. Sentinels. A sentinel is a data item generated ad-hoc by the client and whose classification (i.e., job's result) is known a priori. In the following, when clear from the context, we will use the term sentinel to refer interchangeably to the sentinel data item or its classification job. Receiving for a sentinel a result different from the known classification signals an integrity violation. Formally, a sentinel d signals an integrity violation whenever g i (d)6 ¼g(d), with v(d)=w i . Twins. A twin is a replica of a data item whose classification job is submitted to more than one worker. Like for sentinels, in the following, when clear from the context, we will use the term twin to refer interchangeably to a replicated data item or the corresponding classification job. We will use the term twin set to refer to a data item and its replicas. We also note that, while for simplicity we refer to a replicated data item, in practice there is no need to actually create a replica of the data item, but it is sufficient to allocate the corresponding job to multiple workers. Formally, allocation function v is extended to possibly assign each data item to a set of workers (in contrast to a single one), that is, v : D ! PðW Þ. A classification job can be replicated as many times as wished, and the identification of the optimal number of replicas to be used is one of the contributions of this paper. The name twin reveals the fact that, according to our analysis (Section 5), best effectiveness is achieved when a twin set has cardinality 2 (i.e., original data item plus one copy). Receiving for twin jobs inconsistent results signals an integrity violation. Formally, a twin set signals an integrity violation whenever 9w i ; w k 2 vðdÞ such that g i (d)6 ¼g k (d).
Clearly, to be effective, integrity checks should not be recognizable as such by the workers, which could otherwise go undetected in their possible misbehavior by simply performing well on the jobs on which they know to be controlled. For twins, allocating twin jobs to different workers, which we consider as a good design principle for such kind of control (as distribution of replicas to different workers provides a natural cross-check), already guarantees such a property. For sentinels, such a property has to be taken into account in the generation of sentinel jobs.
The contribution of this paper is to evaluate how sentinels and twins should be produced to be best effective and how integrity controls should be distributed among sentinels and twins so to be effective and provide the greatest integrity guarantees. We first analyze sentinels and twins independently, and then investigate their combination. Table 1 summarizes the notation used in this paper.

SENTINEL ANALYSIS
We start by analyzing the effectiveness of integrity assessment through sentinels. We first focus on a single worker, which we assume to be lazy, because each sentinel controls the behavior of one worker (the one in charge of its evaluation) and its effectiveness is not influenced by the behavior of other workers. The result of the analysis applies in general to any of the workers. We then generalize the results to assess the effectiveness of sentinels in the system.
The goal of our analysis on sentinels is to determine how the number of sentinels (or, more precisely, their percentage with respect to the overall number of jobs assigned to the worker) and their distribution in the different classes affect integrity guarantees. Clearly, the amount of sentinels affects integrity guarantees: the higher the percentage of jobs on which a worker is checked, the lower the probability of the worker to go undetected when omitting jobs. As we will show in this section, also the distribution of sentinels in classes plays a role on the effectiveness of the control. In the following, we first evaluate the probability of a lazy worker to go undetected by sentinels control when omitting some of its jobs (Section 4.1). With such probability turning to be also dependent on the distribution of sentinels in classes, we then analyze possible sentinel distribution strategies and their effectiveness (Section 4.2). Finally, we extend this result to the whole collection of workers (Section 4.3).

Probability Analysis
A lazy worker passes sentinels control if it performs correctly on all the sentinels, that is, if the classification returned for all sentinels coincides with the classification known to the client. This happens when the worker actually performs the job or when the worker does not perform the job but it returns a correct result for it. The probability of the worker to go undetected (i.e., to return a correct result for all sentinels) when omitting some of its jobs, depends then on: 1) the probability of sentinels to fall in the omitted jobs (since only sentinels are controlled, omission of genuine jobs goes undetected) and 2) the probability of the result returned for omitted sentinels to be correct. Let us then analyze each of these probabilities. Probability of omitting sentinels. Consider a set of j jobs comprising s sentinels and assume the worker omits o of the jobs. The probability that o s of the o omitted jobs are sentinels (i.e., the probability of the worker to omit o s of the s sentinels) follows a hypergeometric distribution and is as follows: The higher the number s of sentinels, the higher the probability of omitting o s of them. For instance, the probability that the worker omits 5 sentinels is: 2.72% using 20 sentinels, 18.27% using 40 sentinels, and 18.51% using 60 sentinels. Probability of correctly guessing a job. Even when omitting some sentinels, the worker could go undetected if the result it returns for them is correct. The probability that the worker correctly guesses the class of a sentinel when omitting its computation depends on the strategy adopted by the worker when selecting the classes to be returned for omitted jobs and on the distribution of sentinels into classes. The assignment, by the worker, of omitted jobs to classes can be modeled as a probability mass function P O . Hence, P O ðc i Þ is the probability that the worker assigns class c i to an omitted job. If the worker randomly extracts the class of the omitted job from P O , the probability that it guesses the correct class of an omitted sentinel is: where P S describes the probability mass function of sentinels (i.e., their distribution) in classes. Indeed, the probability of correctly guessing class c i to which a sentinel belongs is the product of the probability that the worker chooses c i for the omitted sentinel job (i.e., P O ðc i Þ), multiplied by the probability that c i is the correct class for it (i.e., P S ðc i Þ). Table 2 illustrates the probability for a lazy worker to correctly guess the class of an omitted sentinel (p guess sent), considering different distributions for P S (following the Zipf's law with a=0, a=1, and a=7) and different strategies P O for the worker to classify omitted jobs. In particular, we consider three possible strategies that the worker can choose to classify omitted jobs: always in the most frequent class in P S (i.e., P O =[1,0,0,0]), according to a uniform distribution (i.e., P O =[0.25,0.25,0.25,0.25]), according to P S (i.e., P O =P S ). For instance, when P S is a Zipf's with a ¼ 1 and the worker classifies all the omitted jobs in the most frequent class, p guess sent is 48%. The table shows that, independently from P S , the most convenient strategy for the worker to maximize p guess sent consists in classifying all the omitted jobs in one class (ideally, the most frequent in P S , which the worker however does not know). In fact, distributing its guesses among different classes, the worker has greater number of genuine jobs assigned to a worker j number of genuine and control jobs assigned to a worker s number of sentinels assigned to a worker s tot overall number of sentinels t number of twins assigned to a worker T overall number of twin sets r number of replicas in a twin set l T number of lazy workers in a twin set l overall number of lazy workers o number of jobs omitted by a lazy worker o s number of sentinels omitted by a lazy worker P D probability mass function of genuine data in classes P S probability mass function of sentinels in classes P O probability mass function adopted by a worker for omitted jobs probability of assigning a wrong class to an omitted sentinel. This is visible in Table 2, where the first column has always values greater than, or equal to, the ones in the other columns.
Probability of passing sentinels control. Since the classifications of different jobs are independent events, the probability that a worker passes integrity control when omitting o s sentinels is the product of the probability of the worker to omit o s sentinels multiplied by the probability of correctly guessing their classification. Formally, such probability is p omit sentðo s Þ Á ðp guess sentÞ o s , with the second term reducing to 1 when o s =0 (i.e., no sentinel is omitted). Therefore, the probability of the worker to pass sentinels control (i.e., the probability of a worker omitting o s jobs to go undetected in its omissions), is the sum, over all possible values of o s , of the probabilities of the worker to pass sentinels control when omitting o s sentinels. Formally: p pass sent ¼ X minðs;oÞ os¼0 p omit sentðo s Þ Á ðp guess sentÞ os ð Þ Note that the sum terminates at min(s,o) since the worker cannot omit more sentinels than either the number of sentinels it has received or the number of jobs it omits. Fig. 3 compares the values of p pass sent, varying the number s of sentinels and the number o of omitted jobs, obtained analytically and through 1000 Monte Carlo simulations. In the simulations, we considered a worker in charge of the classification of j ¼ 200 jobs that can be classified into 4 classes C ¼ fc 1 ; c 2 ; c 3 ; c 4 g. The probability mass function P D is a Zipf's law with a ¼ 1, sentinels are distributed according to a uniform distribution (i.e., P S is uniform), and the worker classifies all the omitted jobs in the most frequent class in P D (i.e., P O is [1,0,0,0]). As visible from the figure, the analytical and numerical values nicely match.

Sentinels Distribution
The client can operate on two factors in injecting sentinels: the number s of sentinels, and their distribution P S into classes. Indeed, the other parameters in the formula of p pass sent are not under the control of the client. Clearly, the higher the number of sentinels, the higher the probability of the worker to hit them (i.e., the higher o s ) when omitting jobs. The average value of o s is oÁs j , which is the average of the hypergeometric distribution regulating p omit sent. However, an omitted sentinel can be detected only if the response returned for it is different from the correct one, and here is where the distribution of sentinels into classes comes into play.
We identify three possible strategies that the client can use for distributing sentinels into classes.
Genuine data distribution (P S =P D ): sentinels are distributed into classes following the same distribution as the data. The rationale behind such a strategy is to follow cardinality of the genuine jobs in assigning sentinels, so that each class receives a number of sentinels according to its expected cardinality. Normalized inverse distribution of genuine data (P S =P D À1 ): sentinels are distributed into classes following the normalized inverse of P D . Therefore, the most frequent class in P D is the least frequent in P S (and vice versa). The rationale behind such a strategy, working opposite to the one above, is to inject more control with a result that is less expected. Uniform distribution (P S uniform): sentinels are distributed equally among classes, regardless of data distribution. The rationale behind such a strategy is to make any guess equally likely for correctness. Note that a limited number of sentinels is sufficient to provide high integrity guarantees (see Section 6.3). Hence, the addition of sentinels is not expected to considerably modify the data distribution of genuine values.
The effectiveness of sentinels' control depends also on the strategy that the lazy worker can adopt at its side. As noted in the previous subsection, the lazy worker does not have any convenience in distributing omitted jobs in different classes (as its probability of correctly guessing the class would naturally decrease) and its best bet is instead to consistently assign omitted jobs to a single class. In fact, such a class would be the one most probable according to data distribution but, since  only sentinel jobs are actually checked, the worker should reason on what would be most probable as a correct guess for sentinels themselves (i.e., P S in contrast to P D ). This observation implies two possible choices for the default class that a lazy worker could assign to omitted jobs: the one most frequent in the data distribution, assuming sentinels follow the same distribution as data; the one least frequent in the data distribution, assuming the hypothesis above to be too obvious and therefore cleverness at the client side in injecting control with results that are less expected. Fig. 4 illustrates the probability of the worker to pass integrity controls (i.e., value of p pass sent) in the different possible combinations of the client (P S ) and worker (P O ) strategies. It is immediate to see that a possible strategy for the client to follow a non-uniform distribution for sentinels, opting for either the same distribution as the data or the normalized inverse one, is exposed to the risk of the lazy worker to actually succeed in its opportunistic assignment of omitted jobs to the most frequent, or least frequent, class of data distribution. This is well visible in Fig. 4, that shows the high probability of the worker to go undetected in omitting jobs when choosing the most frequent class (when the client distributed sentinels with same distribution as the data, Fig. 4a) or the least frequent class (when the client distributed sentinels with a distribution inverse to the one of the data, Fig. 4d). Uniform distribution for sentinels instead confirms to be resilient to possible opportunistic guesses (Figs. 4e and 4f). Clearly, the weakness of sentinels' distribution different from the uniform one in Fig. 4 is better visible when the data distribution (and therefore the sentinels' distribution) is skewed (i.e., a ¼ 7). When the data distribution is not skewed (i.e., a ¼ 0 and a ¼ 1), data will tend to be more uniformly distributed and therefore strategies P S =P D (Figs. 4a and 4b) and P S =P D À1 (Figs. 4c and 4a) for sentinels distribution will resemble the uniform one. Intuitively, the strength of uniform distribution for sentinels is to not make any class (and hence correct guess for omitted jobs) more likely to be correct than another, and therefore dismount possible opportunistic behavior: the worker would not know on which class it is best to bet for its omitted jobs, and any possible strategy it could adopt would suffer from low probability of hitting a correct guess (as visible in the first row of Table 2, uniform distribution for sentinels being a Zipf with a ¼ 0).

Multiple Workers
The analysis above has considered an individual worker, in line with the fact that each sentinel controls the behavior of a single worker. The effectiveness of sentinels control on the overall system is simply the combination of the controls over the different workers. The probability of possible lazy behavior in the overall system to go undetected (i.e., the probability that all workers pass sentinels control) is then: where p pass sent i is the value of p pass sent computed with the specific values of j jobs, s sentinels, and o omissions considered for worker w i . When the parameters are the same for all workers, p pass sentinels reduces to ðp pass sentÞ l , with l the total number of lazy workers. In fact, p pass sent=1 for workers that are either honest (i.e., o=0) or that receive no sentinels (i.e., s=0).
Note that, since the client does not know nor suspect which workers are honest and which workers are lazy, sentinels are uniformly distributed among all the workers. The reason for workers to receive no sentinels might be that the total number of sentinels assumed to be injected overall is smaller than the number of workers. In fact, as we will see in Section 6, even a few sentinels covering just a few workers are sufficient and provide adequate effectiveness when combined with twins, with no need to distribute sentinels to every single worker. We also note that sentinels, operating independently on different workers are by definition not exposed to collusion. As a final remark, we note that the total number of sentinels that may be injected might be not sufficient to cover all the classes in the data distribution. In this case, maintaining the idea of distributing sentinels as uniformly as possible in the different classes, the best client strategy would be to randomly select classes to which assign sentinels.

TWIN ANALYSIS
We now analyze the effectiveness of integrity assessment through twins. We recall that: the term twin set refers to the set of r's (virtual) replicas of a data item, T is the overall number of twin sets generated by the client, and t is the number of twins assigned to each worker. With uniform distribution of twins to workers, we have T Ár For simplicity of the formulas, without loss of generality, we assume T Á r to be a multiple of the number w of workers and hence each worker to receive t twins. (We relaxed such an assumption in computing results for plotting curves of our analysis and in our simulations.) We first evaluate the probability of workers to go undetected by twins control when omitting jobs (Section 5.1). With such probability turning out to be dependent on the replication factor and the number of twin sets, we then analyze possible strategies for twin generation and allocation (Section 5.2). We first focus on a single twin set because each twin set works independently for providing integrity guarantees. We then generalize the results to the consideration of multiple twin sets.

Probability Analysis
Workers to which data items in a twin set are assigned pass integrity control by the twin set if they all return the same result for the items in the set (twin jobs). This happens in the following cases: 1) all workers actually perform the twin job (and hence return the correct result for it), 2) some of the workers omit the twin job but correctly guess its result, or 3) all workers omit the twin job but they all return the same result for it. Let us then analyze the probability of lazy workers to be involved in the evaluation of a twin set and to omit the twin. We then evaluate the probability of workers involved in a twin set to return the same result, when some, or all, of them are lazy. Probability of lazy workers in the twin set. The first parameter influencing twins effectiveness is the probability of lazy (versus honest) workers to be involved in the evaluation of a twin set. In fact, honest workers, by always returning a correct response, will increase exposure of possible misbehaviors by others. The probability of having lazy workers in the twin set depends on the number of lazy workers in the system and on the replication factor (i.e., the cardinality of the twin set). With a reasoning similar to the one followed for sentinel omissions, it is easy to see that the probability of having l T lazy workers involved in a twin set assuming w workers in the system, l of which are lazy, and a replication factor of r, follows a hypergeometric distribution. Formally: Probability of omitting twins. Suppose that l T out of the r workers in charge of evaluating a twin set are lazy. The probability that a lazy worker omits a twin job is o j (i.e., number of omissions divided by the number of jobs assigned to the worker). Since omissions by different workers are independent events, probability ð o j Þ i Á ð1 À o j Þ l T Ài corresponds to the probability that i, with 0 i l T , workers omit the twin job and l T À i workers do not omit it. Given l T lazy workers, there are l T i À Á possible combinations of i out of l T lazy workers and therefore the probability that any subset of i lazy workers omits the twin job in the same twin set is: Probability of Same Classification. In the evaluation of the probability of all workers to return the same classification, we distinguish the case where such classification is the correct one for the twin job from the case where it is a wrong one (but consistently returned by all workers). The probability of a worker to correctly guess the class of an omitted job is P c k¼1 ðP O ðc k Þ Á P D ðc k ÞÞ. The probability that i, with 0 i l T , workers guess the correct class of the omitted twin job is then: With a similar reasoning, the probability that i, with 0 i l T , workers omitting a job return the same (but wrong) result is: Table 3 illustrates the probability of workers omitting the job to return the same class (correct for p guess twin and wrong for p same wrong) assuming different probability mass functions for genuine data P D and for omitted jobs P O . Note that, in the extreme case where a twin set is assigned to workers that are all lazy and all of them omit the job, if workers are deterministic in their classification of omissions (e.g., classifying all items in the most frequent class in P D ), the misbehavior will not be discovered since all workers will be in agreement on their result (regardless of whether it is correct or wrong). This is visible in the table where, for P O =[1,0,0,0], p guess twin(10)+p same wrong(10)=1, regardless of the distribution of P D . Table 3 is analogous to Table 2 for sentinels. Note, however, that rows in Table 2 report the distribution of sentinels while rows in Table 3 report the distribution of data themselves. In fact, since twins are randomly selected in the data population, unlike for sentinels, their distribution cannot be regulated. For the randomness of the process, distribution of twins can be considered to follow the distribution of the genuine data. Probability of passing twins control. The probability of a twin set to show the same result from all workers and therefore not to signal any violation (even in the presence of lazy workers and omissions) can be obtained by applying the total probability theorem to the probabilities discussed above. More precisely, it is the sum of: 1) the probability of having all workers involved in the twin set to return the correct result, even when some workers are lazy and omit the twin job, and 2) the probability of all workers involved in the twin set to be lazy, to omit the twin job, and to return the same (wrong) result. The first probability is the sum, over all possible values of l T , of the probability of having l T lazy workers involved in the twin set multiplied by the probability of i of them omitting the job to guess the correct result (for all possible values of i). The second probability is the product of the probability of the different events concurring to it. Therefore, the probability of observing the same result for all jobs in a twin set is: ðp omit twinði; l T Þ Á p guess twinðiÞÞ þ p lazyðrÞ Á p omit twinðr; rÞ Á p same wrongðrÞ Note that the sum terminates at min(l,r) since a twin set cannot include more lazy workers than either the number of lazy workers assumed to be part of the system or the number of replicas in the twin set. When T twin sets are used, the probability, denoted p pass twins, of twins not to signal any violation is:

Twin Generation Strategy
The client can operate on two factors when injecting twins control: the number of twin sets (T ) and the replication factor (r), with the two being also related to the number of twin jobs assigned to each worker (t Á w=T Á r). Indeed, other parameters in the formula of p pass twins are not under the control of the client. Clearly, the more twins a worker receives, the more the control to which the worker is subject. But, assuming willingness of the client to pay a load of T Á ðr À 1Þ additional jobs (twin jobs in addition to the original ones), is it better to have a larger T or a larger r? For instance, would it be better to have three twins for the same data item (T =1, r=3) or two twins of two data items (T =2, r=2)? Each of the two strategies would bring an equal additional load to the system (two additional jobs to be allocated), but which one is more effective?
Looking at the formula of p pass twins, and the dependency between the variables, it is easy to see that p pass twins decreases exponentially with the increase of the number of twin sets T (which is larger for smaller values of r). It decreases instead very slowly with the increase of the number r of replicas, as visible in Fig. 6a. The figure shows how the probability that an omission of o=50 jobs, assuming P D with a ¼ 1, w=100 and l=49 (i.e., just the slight majority of the workers is honest), goes undetected, varying the additional jobs inserted as twins (i.e., T Á ðr À 1Þ) in the two extreme cases: 1) only one twin set is considered (i.e.,T =1) while the number of replicas varies, 2) only one additional replica per item is considered (i.e., r=2) and the number of twin sets varies. The figure confirms the observation above. In particular, the curve for T =1 shows an asymptotic behavior, tending to P max , which is the probability of the most frequent class in P D (P max =0.48 in the figure). In fact, with probability P max , the jobs in the twin set belong to the most frequent class. In this case, the workers correctly guess the class of twin jobs. Such a correct guess then happens with probability P max .
In summary, best effectiveness is achieved by keeping replication factor minimum (r=2) while increasing the number of twin sets (i.e., twinning different data items). Of course, care must be taken in the allocation of twins to workers, to guarantee that the cross-checks provided by twins do not partition workers dividing lazy from honest workers. Such condition can be easily ensured by assigning at least one twin to every worker, uniformly distributing twins between workers, and allocating twin sets to cover different combinations of workers. Good coverage of the  different combinations provides also fine resilience against collusion. In this respect, we note that even if a fraction 1 x of the workers collude, only the twin sets fully covered by the colluding group become ineffective, roughly less than 1 x 2 of the twin sets, hence leaving the lazy workers still exposed to the control on the other twin sets touching workers outside the clique (see Section 7).
The effectiveness of twins depends also on the probability mass function P D and on the strategy adopted by the workers for classifying the omitted jobs (i.e., P O ). As already noted, since twins are selected randomly among the data items, the client cannot dictate twin distribution into classes as done for sentinels and, given the randomness of the process, twins can be expected to follow the same distribution P D as the original data. Hence, the best strategy for lazy workers behaving opportunistically to the aim of going undetected in twin omissions is to classify omitted jobs in the most frequent class of the data distribution. This observation implies, unfortunately, determinism in opportunistic behavior of lazy workers and therefore their equal classification of omitted twin jobs that can make twin control ineffective when data distribution is highly skewed (as the most frequent class of data that lazy workers assign to twin jobs has high probability to be the correct one, hence matching the result returned by honest workers) or where all the workers are lazy and their omission extreme (as they will all omit twins, consistently classifying them in the same class). Figs. 6b and 6c illustrate the effectiveness of twins as well as their limitations in such situations. Fig. 6b shows p pass twins varying the number t of twins between 0 and 60 and assuming r=2, w=10 workers (4 of which are lazy), o=20 omitted jobs by each lazy worker, and j=200 jobs assigned to each worker. While quickly effective for Zipf's data distributions with a=0 and a=1, twins fail to be effective for a=7. Fig. 6c considers a scenario with w=10 workers each in charge of 200 jobs, 10 of which are twins (with r=2 and T =50), and the distribution of data is expressed by a Zipf's law with a=1. It shows how probability p pass twins changes varying the number o of omitted jobs, assuming the number l of lazy workers to be 4, 7, or 10 (i.e., 40%, 70%, or 100%). Again, while quickly effective even when the majority of workers (70%) is lazy and their omissions extreme (i.e., approaching all 200 jobs), twins cannot help in detecting extreme omissions (i.e., approaching all the 200 jobs) when all workers are lazy. In this case, none of the workers would perform the work and all of them would consistently return the most probable class (i.e., the most frequent in the data distribution).

COMBINING SENTINELS AND TWINS
In the previous sections, we have discussed sentinels and twins independently. When both techniques are used, the overall integrity guarantee is given by their combination. Hence, the probability of workers to pass integrity control becomes: p undetected ¼ p pass sentinels Á p pass twins We now investigate how the two techniques should be used to provide best effectiveness. Our starting point is represented by the lessons learned from the analysis in the previous sections, which can be summarized with the following three observations. First, the best strategy for the client to maximize effectiveness of control is to distribute sentinels uniformly among the different classes, and to use a replication factor r of 2 for twins. Second, the best strategy for opportunistic lazy workers to maximize probability of going undetected when omitting jobs is to classify omissions in the most frequent class of the data distribution (i.e., the class with probability P max ). Third, twins are, in general, more effective than sentinels (roughly twice as effective, since with one additional job the behavior of two workers is controlled). However, twins fall short when the data distribution is highly skewed (i.e., P max is high) given the high probability of omitted jobs to be classified correctly. Also, twins lose effectiveness when omissions are extreme by all workers (since all lazy workers would return the same result for the twin jobs assigned to them which they omit to compute).
With the analysis in the previous sections enabling us to capture integrity guarantees (in terms of p undetected), our challenge is first to determine the value of P max where sentinels take over in terms of effectiveness with respect to twins and second to determine how to counteract extreme omissions injecting some sentinels in scenarios where twins are more effective.

Threshold Value of P max
The value of P max for which either twins or sentinels are expected to be more effective depends on two factors: the presence of lazy workers and the percentage of omissions. Identifying whether the P max of a given scenario falls below or above such an optimal value would require to plot the curves of p pass sentinels and p pass twins for the given scenario and see the value of P max at which one becomes higher than the other. Considering the three observations above and some simplifying assumptions that enable analytical treatment, we obtain an analytical formula for the derivation of the threshold value of P max indicating whether twins or sentinels are more effective.
As for sentinels, where p pass sentinels ¼ Q w i¼1 p pass sent i over the different workers, we simplify p pass sent assuming the omissions of sentinels as a sampling with replacement. Sampling with replacement means that the sample values are independent. In our scenario, this means that the omission of a job does not affect subsequent omissions. In other words, when we omit a job, that job is put back in the set of jobs; subsequent omissions operate considering the complete set of jobs. Hence, we consider for each worker the formula for one sentinel and elevate it to the number of sentinels assigned to the worker. The imprecision introduced by the adoption of this simplification is negligible and has the advantage that the formula exhibits a clear analytical structure. The overall product over the different workers equates then to elevating the p pass sent for a single sentinel to the total number of sentinels in the system, that is, p pass sentinels ¼ p pass sent s tot . For a single sentinel (s ¼ 1), the formula for p pass sent of Section 4.1 gives us: ðp omit sentðo s Þ Á ðp guess sentÞ o s Þ ¼ p omit sentð0Þ þ p omit sentð1Þ Á p guess sent where p guess sent ¼ 1 c derives from injecting in the formula of Section 4.1 the observations above about the client distributing sentinels uniformly and workers opportunistically returning the most frequent class for omitted jobs.
As for twins, where p pass twins ¼ p consistent T , we can rewrite the formula for p consistent applying the observations above on the replication factor (i.e., r ¼ 2) and considering opportunistic behavior by lazy workers for omitted jobs (i.e., they consistently classify the omitted jobs in the most frequent class of the data distribution). As a simplification, we consider the worst case scenario of both workers involved in the twin set to be lazy (i.e., l T =2) and, for the single twin set, assume them to be the only workers in the system (i.e., w=2). The exponentiation to T will take care of the inclusion of the other workers. Basically, our simplification assumes twin sets to operate independently (covering disjoint sets of pairs of lazy workers). The formula for p consistent of Section 5.1 then becomes: ðp omit twinði; 2Þ Á p guess twinðiÞÞþ þ p lazyð2Þ Á p omit twinð2; 2Þ Á p same wrongð2Þ ¼ where the last approximation removes the least significant factor since, assuming o ( j, term ð o j Þ 2 becomes negligible. Let us then compare the formulas obtained above to evaluate when sentinels are more effective than twins, that is, when the probability of passing sentinel controls is lower than the one of passing twin controls (p pass sentinels < p pass twins). Of course, comparison is to be made assuming the same overhead for the client in terms of additional jobs to be inserted, hence considering an equal number of sentinels and twin sets, that is, s tot ¼ T (as each of them requires an additional job to be injected in the system). With s tot ¼ T , the exponents at both sides of the equation can be discarded and comparison reduces to check when p pass sent < p consistent, that is, when: which gives: In summary, our analysis tells us that, for each scenario, either sentinels or twins should be used as control jobs, depending on the value of P max characterizing the scenario. When c ¼ 2, which is the lowest value c can assume, twins (sentinels, resp.) should be used if P max is lower (higher, resp.) than or equal to 0.75. As the number of classes grows the value of P max at which sentinels are more effective than twins decreases, reaching 0.50 as the number of classes becomes very large (and hence 1 c negligible). The formula above also tells us that when P max is not higher than 0.50, twins are always more effective than sentinels.
The number of control jobs to be injected depends on the aimed integrity guarantees, expressed in terms of p undetected and can be simply obtained from our formulas. Basically, given a client's established threshold ", p undetected < " can be guaranteed by employing either T > log ð"Þ log ðp consistentÞ twin sets or s tot > log ð"Þ log ðp pass sentÞ sentinels (Section 6.3 will elaborate more on this). A note aside is to be made for scenarios where twins are more effective than sentinels (i.e., P max < 1 2 Á ð1 þ 1 c Þ) with respect to extreme omissions, which make twins lose effectiveness. Luckily, the different nature of the two controls makes sentinels extremely effective in such scenarios, and adding a handful of sentinels when using twins suffices to detect extreme omissions, as we see next.

Extreme Conditions
Extreme conditions refer to (quite unlikely) scenarios where omissions by all the workers in the system are considerable, approaching almost the total number of jobs.
The most extreme case where all jobs are omitted (i.e., all workers omit all the jobs) is easy to analyze. In such a case, p pass twins=1 (since all omissions will be classified in the most frequent class of the data distribution). Also, since all jobs are omitted (i.e., o=j and o s =s), p pass sentinels reduces to ð 1 c Þ s tot (i.e., p pass sent= 1 c ). In other words, a number of sentinels s tot > À log ð"Þ log ðcÞ is sufficient to maintain p undetected < ", guaranteed by twins, also when all workers omit all their jobs. The analysis of generic extreme conditions where workers omit most, but not all, of their jobs is much more complex. To tackle it, we consider the worst case scenario, modeling the whole set of workers as a single worker. We also assume workers to know the total number s tot of sentinels and the number T of twin sets, and therefore to be able to estimate the number E of jobs to elaborate to maximize the probability that their extreme omissions go undetected. Providing protection against the worst case scenario clearly gives the strongest protection, also maintained in situations where such information is not known and workers operate independently in their opportunistic behaviors. When workers omit most of the jobs, the omissions go undetected if the workers process all the sentinels and do not process any of the twins. The probability, called p extreme, that, when processing only E out of the total number J of jobs in the system while omitting the other O ¼ J À E jobs, all the sentinels are processed and all the twins are omitted is: where ð s tot J Þ s tot is the probability of processing all the sentinels and ð1 À s tot þ2T ð1ÀPmaxÞ J Þ EÀs tot is the probability that, among the E À s tot processed jobs, twin jobs that do not fall in the most frequent class are all omitted. The rationale is that among the 2T twin jobs, only 2T ð1 À P max Þ twin jobs that do not fall in the most frequent class are effective in signaling omissions (for twins that fall in the most frequent class, omissions cannot be distinguished from processing as they produce the same result).
To determine the maximum value of p extreme, denoting with E the value of E reaching it, we observe that to process as few E jobs as possible while minimizing the risk of being detected, opportunistic lazy workers need to aim for the lowest E such that p extremeðEÞ > p extremeðE þ 1Þ. Considering the formula of p extreme this translates to: which corresponds to: giving: Since the number s tot of sentinels to be added is considerably smaller than the number T of twin sets and the number J of jobs is large (especially when compared with the total number of sentinels and twin sets), the value E that maximizes p extreme(E) can be approximated to The probability of workers to be undetected when processing only E jobs is then: Às tot which can be then simplified applying the following steps: x Þ x % e y , can be approximated as e Às tot . The maximum value of p extreme can then be approximated as: The validity of this formula has been verified using simulations in a variety of configurations.
For guaranteeing a probability of extreme omissions to go undetected lower than " 0 (where " 0 could be equal to or different from the " required for p undetected), the number s tot of sentinels to inject to maintain the protection guarantees provided by twins can be then derived instantiating the formula above with the P max of the considered scenario and T computed as illustrated at the end of Section 6.1. Again, as we see next, a handful of sentinels overall suffices.

Integrity Guarantees
The analysis above enables us to evaluate effectiveness of integrity controls for realistic scenarios characterized by large workloads. Indeed, Sections 4 and 5 considered a relatively small workload (J ¼ 2000 and w ¼ 10), with the goal to understand the behavior of the techniques, rather than demonstrating their effectiveness. We have then evaluated p undetected in configurations characterized by different numbers of jobs and varying: i) the percentage of control jobs (from 5% to 25%); ii) the percentage of omitted jobs (1%, 0.1%, and 0.01%); and iii) the probability P max of the most frequent class (0, 0.25, and !0.50). The value of P max determined also the kind (twin sets versus sentinels) of control jobs to be injected. As we assumed a large number of classes, P max ¼ 0:50 is the threshold where twins hand over to sentinels for effectiveness and therefore the values exhibited for the case P max ! 0:50 assumes use of sentinels, noting that the case P max ¼ 0:50 applies to twins as well. Also, since effectiveness of sentinels does not depend on P max (given that sentinels are uniformly distributed), the exhibited values for the case P max ! 0:50 hold for all P max > 0:50. Tables 4 and 5 show the probability of omissions to go undetected for two scenarios characterized by 10 6 and 10 8 jobs, respectively, varying the different parameters. For instance, just injecting 5% of additional control jobs provides a p undetected (i.e., the probability of omissions to go undetected) varying from 4.59e-435 to 6.78e-218 (depending on P max ) when omissions are up to 0.01% of the 10 8 jobs. Clearly, increasing the amount of controls or as omissions increase makes such a probability lower (with a value of 7.96e-219349, when 25% control jobs are added and omissions hit 1% of the jobs). Similarly, increasing the overall number of jobs leads, for the same percentage increase in integrity cost, to an improvement in the effectiveness of integrity verification (with 10 6 jobs, p undetected varies, injecting 5% additional control jobs, from 4.54e-05 to 6.74e-03 when omissions are up to 0.01%). Table 6 shows the (impressively low) absolute maximum number of jobs that workers can afford to omit while maintaining the probability of being undetected higher than 10 À10 . For instance, even when just 5% control jobs are injected, workers will be detected with probability greater than 1 À 10 À10 if omissions are more than 231, 308, or 461 (depending on P max ).
For evaluating the number of sentinels to be injected to provide protection against extreme omissions when using twins (i.e., when P max 0:50), we imposed p extreme < 10 À10 . The number of sentinels varied from a maximum of 5 (for a scenario with 10 5 jobs) to just 1 (for scenarios with 10 12 jobs or above). The number of sentinels was either 2 or 3 for the scenario with 10 6 jobs and consistently 2 for the scenario with 10 8 jobs, for all the configurations in Tables 4 and 5.

WORKERS COLLUSION
In the previous sections we have assumed workers not to collude. Collusion may occur if the workers appear to be independent, but are all executing under the control of a same party. As we will discuss in this section, the techniques are resilient to collusion. Indeed, it is sufficient to take into consideration the fact that in presence of collusion some of the control jobs (twins) can become ineffective. Sentinels, being unique, are clearly not affected by collusion, since no worker will have a sentinel in common with another worker. By contrast, twins are exposed to being recognized as such by colluding workers, which can agree on the same (not computed) response to be returned for the jobs so to go undetected in their omission. While noting that, if the characteristic of the computation allows it, the ability of colluding workers to recognize twins can be counteracted by making twins different one from the other (e.g., for image recognition altering the twinned copies in some ways without altering the result of the computation), we assume common twins to be completely recognizable by colluding workers. In the following, we first investigate how the effectiveness of twins changes and then discuss how colluding workers can be discovered.
Analysis. Assume that l C of the l lazy workers (of the total of w workers) collude, forming a coalition. Considering a replication factor r ¼ 2, a twin set can be recognized by the colluding workers only if both jobs are allocated to them (while twin sets for which at least one of the jobs is outside the coalition cannot be recognized as such). The probability of a twin set to be recognized as such by a coalition is then: corresponding to a number of twin sets T C under the control of the coalition equal to T C ¼ T Á p twin coalition. For instance, if a coalition controls half of the workers, around 1 4 -th of the twins will be controlled; this means that 3 4 -ths of the twins will still be effective.
Colluding workers can opportunistically behave on the discovered twins by omitting them and agreeing on the response to be returned for each twin set (while omitting the work). The formulas in Sections 4 and 5 continue then to hold removing, for lazy workers, the number of twinned jobs under the control of the coalition from both the number o of omitted jobs and the total number j of jobs. Assuming twins to be uniformly distributed among workers, on average each colluding worker is assigned avgjobs ¼ 2T C l twins of the 2T C twinned data items allocated to workers in the coalition. Hence, in the formalization presented in Sections 4    Table 5 and 5, o becomes, for lazy workers, max(o À avgjobs, 0) and j becomes j À avgjobs. Fig. 7 illustrates the probability p pass twins obtained through 10000 Monte Carlo simulations varying the percentage of colluding workers with respect to lazy workers, assuming w=100 workers of which l=49 are lazy, j=200 jobs, T =20 twin sets, r=2, and a ¼ 1. As visible from the figure, p pass twins grows with the number of lazy workers in the coalition. However, the growth is smooth and less than linear.
Colluding Workers Identification. To identify colluding workers in presence of a single coalition that includes all the l lazy workers, we exploit the possible inconsistent results returned for twin sets that are not fully covered by colluding workers. Indeed, inconsistent results may only happen when a twin set is processed by a worker in the coalition (lazy worker) and by a worker outside the coalition (honest worker). The basic idea of the approach for discovering lazy workers consists in modeling the inconsistent results returned for twin sets as a graph G where there is a node for each worker w i in the system, and an edge (w i ,w k ) iff 9d such that w i ; w k 2 vðdÞ and g i (d)6 ¼g k (d). Fig. 8a illustrates an example of graph modeling a system with eight workers where, for instance, w 1 and w 2 return a different class for the same data item. For each edge (w i ,w k ) in G, either w i or w j is lazy. Indeed honest workers compute the same (correct) class, while colluding workers return the same (non computed) class. For instance, considering the graph in Fig. 8a, either w 1 or w 2 is lazy and the other one is honest. Graph G is then bipartite and possibly composed of different connected components.
To identify lazy workers, given graph G, the client first computes the connected components of G. For each connected component with more than one node, the nodes in the component are partitioned in two subsets, V 1 and V 2 , in such a way that, for each edge (w i ,w k ) in the component, w i 2 V 1 and w k 2 V 2 . Finally, for each connected component the client verifies one of the twins corresponding to one edge in the component: the worker whose classification for the verified twin corresponds to the classification computed by the client is considered honest; the other one is lazy. As an example, consider the graph in Fig. 8a. This graph includes three connected components G x , G y , and G z (Fig. 8b). The single node in G z is considered honest since it passes all integrity controls. Consider now connected component G x .
Here, the nodes are partitioned in two subsets: fw 1 ; w 3 g and fw 2 ; w 4 g. The client could then verify the twin between w 1 and w 2 : if w 1 is honest also w 3 is honest (with w 2 and w 4 lazy), and vice versa. Similarly, the nodes in the connected component G y are partitioned in two subsets, namely fw 5 ; w 7 g and fw 6 g, and the client only needs to verify one of the two twins in G y . Fig. 8c illustrates a possible classification of the workers in the graph in Fig. 8a, where honest workers are green (gray in a b/w print-out) and lazy workers are red (bold in a b/w print-out). After having identified lazy workers, the client can discard the results of the jobs assigned to them, and do not pay them for their services. The jobs in the workload of lazy workers can be assigned to honest workers for their evaluation.
The cost, for the client, to identify lazy workers is linear in the number of twin sets that do not pass the integrity check and requires to recompute a limited number of jobs (one for each connected component in the graph). Considering our example above, the client needs to verify only two out of the five mismatching twins.

EXPERIMENTS
To complete our analysis, we have performed experiments simulating a distributed data computation environment comprising different (potentially lazy) workers. The distributed data computation was controlled by a client program invoking generic functions on an arbitrary number of workers randomly behaving as lazy, and introducing a parameterizable number of twins and sentinels. For the experiments, as an instance of a distributed data computation, we wrote a C program that identifies the lowest result of the application of a cryptographic hash function among a set of input values. (With respect to use in practice, we note that the computation is analogous to the computation that is executed in the Bitcoin network to produce the blocks in the blockchain.) More precisely, each job has as input 1000 values, and the worker has to apply a cryptographic hash function over them and return the value which had the smallest result. Workers behaving lazy were assumed to opportunistically return, for an omitted job, the lowest value in the job's input.
We run experiments on a variety of machines (a server with an Intel Xeon W-2135 3.7 GHz CPU with 6 physical cores; a server with 2 Intel Xeon E5-2620 2.1 GHz CPUs and 16 physical cores; a server with 2 Intel Xeon Gold 5118 2.3 GHz CPUs and 20 physical cores). We have considered 10 6 jobs, each requiring 1000 computations. Jobs were distributed to w=10 workers, with each worker receiving j=10 5 jobs. To note here that, as our analysis has shown, effectiveness of the control depends on the number of overall jobs (i.e., one million for us) rather than the number of workers and individual workload. We considered the use of one  sentinel and four different configurations for twins, assuming 500, 1000, 2500, and 10000 twin sets, respectively. In terms of percentage, the different configurations injected from 0.05% to 1% twin jobs, evenly distributed across workers. We then assessed the probability of workers to be undetected when omitting jobs with an omitted ratio (o/j) from 0.001% to 0.18% (which corresponded to theoretical values of p undetected distant from 0 and 1). Fig. 9 illustrates the result of our experiments. In the figure, the x axis reports the percentage of jobs omitted by workers and the y axis the value of p undetected. Observations for each twin set configuration are color-coded, with the color-coding giving the correspondence with the legend (for b/w printout we note that the legend and the curves are in the same order from top to bottom) For each twin set configuration, the continuous line corresponds to the prediction of the model, with the color band describing for each setup the 95% confidence interval provided by the model. The starred points are the observations (ratios of undetected omissions) from the experiments. Each of the 20 points in the figure has been obtained through 1000 runs of the different configurations above, with an overall execution time of more than 20 CPU-days per worker. As visible from the figure, the experiments fully confirm the accuracy of the model (all stars are aligned with the lines resulting from the theoretical analysis), and the ability to offer high integrity guarantees with a small additional cost.
Probabilistic techniques are more flexible than deterministic techniques and are based on the use of integrity checks added to the original data. Several existing probabilistic techniques are based on replication (e.g., [8], [9], [19]), on the injection of fake data into the original data collection (e.g., [4], [5], [6], [20]), or on watermarks (e.g., [7]). Some works have analyzed the trade-off between the integrity guarantees offered by probabilistic techniques and their overhead, by proposing a game-theoretic framework and assuming that the same task can be assigned to more workers (e.g., [21], [22]). The goal of these proposals is to design a solution that provides an incentive for the workers to compute the correct result. Other proposals (e.g., [18], [19]) focus on MapReduce computations and propose the adoption of a voting mechanism on the result of replicated tasks. These works analyze how much replication is necessary to provide a given probability of detecting integrity violations. In the database context, some works (e.g., [8]) consider the combination of different probabilistic techniques to verify the integrity of join results. They, however, focus more on the efficient design of such integrity techniques but do not provide an analysis of how to effectively tune their use to achieve high-level guarantees with limited cost. Integrity verification techniques have been also developed in the data mining context (e.g., [20]) to support different types of computations. Our work is complementary to the efforts above and provides a novel perspective of investigation.
Our work presents similarities with, and nicely complements, solutions that permit to verify whether a worker correctly computed a function over an input (i.e., the correctness of a single job) at limited cost (e.g., [23]).
Another line of work aims at minimizing the estimation error when the same job is assigned (replicated) to multiple users (e.g., [24], [25]), or at determining whether a user correctly evaluates jobs to optimize job assignment (e.g., [26], [27]). These proposals are complementary to ours since they aim to determine a strategy for maximizing the probability that the results computed by users are correct. Also, they do not consider the possibility of workers to behave opportunistically in returning their results to the aim of being undetected in their omissions.

CONCLUSIONS
Probabilistic integrity techniques allow for assessing integrity of distributed computations performed by possibly untrustworthy workers. If not carefully used, however, such techniques can suffer from limited effectiveness. In this paper, we have focused on two probabilistic techniques, namely sentinels and twins, and provided a model capturing their characteristics and enabling their controlled generation and injection so to provide best effective in achieving integrity guarantees. Our model can then represent a reference for effective integrity assessment in different application scenarios. Our findings can also enable clients to best frame their distributed processing problem so to maximize effectiveness of the control. For instance, with twins being twice as effective as sentinels, an image recognition problem with a yes/no answer where no is the most common answer, and hence sentinels should be applied, can be transformed into a finer-grained problem requesting workers to return, instead of the simple no, the result of a numerical computation on the input (hence distributing the probability of the no), thus enjoying a low P max and enabling the use of twins.
Sushil Jajodia (Fellow, IEEE) is University professor, BDM International professor, and the founding director of the Center for Secure Information Systems, George Mason University, Fairfax, VA. He has authored or coauthored seven books and more than 500 technical papers in the refereed journals and conference proceedings and edited 53 books and conference proceedings. He is also a holder of 28 patents. He has been a visiting professor with several Universities in France, Italy, and UK. He is a fellow of ACM and IFIP.
Stefano Paraboschi (Member, IEEE) is a professor with the Universit a degli Studi di Bergamo, Italy. His research interests include focuses on information security and privacy, Web technology for data intensive applications, XML, information systems, and database technology. He has been a visiting researcher with Stanford University and IBM Almaden, CA, and George Mason University, VA.
Pierangela Samarati (Fellow, IEEE) is a professor with the Universit a degli Studi di Milano, Italy. Her main research interests include data protection, security, and privacy. She has published more than 290 papers in journals, conference proceedings, and books. She has been a Visiting Researcher at Stanford University, CA, SRI International, CA, and George Mason University, VA. She is a fellow of ACM and IFIP.
Roberto Sassi (Senior Member, IEEE) is a professor with the Universit a degli Studi di Milano, Italy. His research interests include mainly focusses on the analysis of digital signals, with particular regard to biomedical ones, on interdisciplinary applications, and on biometrics and privacy protection techniques in biometric applications. The results have been published in more than 150 publications.