Detecting Fraudulent Student Communication in a Multiple Choice Online Test Environment

Online evaluation systems, pervasive nowadays, are known to be susceptible to higher fraud risks. This work proposes a novel and robust method to detect potential fraud acts in online multiple-choice question (MCQ) exams. For the first time, the communication probability between the examinees is statistically assessed based on the concordance of responses and answer time against null expectations and is subsequently used to identify potential fraud behavior. The model is sensitive to the direction of communication acts, distinguishing content consumption from production, as well as multiwise communication channels. Online remote tests from engineering courses at Técnico Lisboa are used as a case study. We show that the cumulative contribution of concordant responses between students, when recurrent, offers a way of signaling fraud behavior. Separating content production from consumption reveals the underlying student role played in potential fraud acts. Collusion behavior is assessed against null models of fraud and conformity, and therefore being statistically framed and offering a solid criterion to guide tutors in ascertaining fraud and discouraging communication.


Detecting Fraudulent Student Communication in a
Multiple Choice Online Test Environment Mariana Carrasco, António Rito Silva , and Rui Henriques Abstract-Online evaluation systems, pervasive nowadays, are known to be susceptible to higher fraud risks.This work proposes a novel and robust method to detect potential fraud acts in online multiple-choice question (MCQ) exams.For the first time, the communication probability between the examinees is statistically assessed based on the concordance of responses and answer time against null expectations and is subsequently used to identify potential fraud behavior.The model is sensitive to the direction of communication acts, distinguishing content consumption from production, as well as multiwise communication channels.Online remote tests from engineering courses at Técnico Lisboa are used as a case study.We show that the cumulative contribution of concordant responses between students, when recurrent, offers a way of signaling fraud behavior.Separating content production from consumption reveals the underlying student role played in potential fraud acts.Collusion behavior is assessed against null models of fraud and conformity, and therefore being statistically framed and offering a solid criterion to guide tutors in ascertaining fraud and discouraging communication.
Index Terms-Communication network, fraud detection, multiple choice quiz, online remote evaluation, statistical significance.

I. INTRODUCTION
O VER the last decade, alongside the developments in technology [1], [2], came, for students, the possibility to enroll in a wide variety of online courses and, in some colleges, to choose between the traditional face-to-face classes and the computer-based classes.Online courses attained their popularity by providing students the flexibility to work in a self-paced manner and reduce attendance costs [3], [4], [5].University administrators are motivated to present online content and assessments to ensure a broader student reach [6].The COVID-19 pandemic converted this possibility into a necessity [7], [8], [9].However, with the remote way of teaching comes the challenge of unsupervised online testing, shown to yield a higher possibility of fraud [9], [10], [11], challenging the fair principle of evaluation [12].
We define fraud as all forms of illegitimate activities that are aimed at increasing one's assessment performance.These activities include using unauthorized materials, copying, collusion among examinees, acquisition of test contents (also termed preknowledge), impersonation [13], or external assistance from someone who is not taking the test [9].In this study, the focus is on collusion among examinees, the arguably most common form of online cheating in multiple choice question (MCQ) exams performed at individual homes [14].
Several statistics have been proposed to assess collusion [15], [16], [17], [18], [19], [20], ranging from item response theory to the analysis of response times.Nevertheless, the existing methods generally suffer from three major drawbacks: 1) Assume fixed question orders and reversible answering; 2) Neglect the distinguished roles and multiwise cumulative effects from inadvertent content sharing exerted in unauthorized communication platforms; 3) Do not reliably test the deviation of the acquired behavioral statistics against plausible expectations.As the first comprehensive effort to address these limitations, this work establishes a novel method to detect potential fraud acts based on timestamped answer records from online quizzes with shuffled questions, capturing potential multiwise acts of information exchange by examinees taking the test at the same time.In the context of multiple choice online test environments, this is an increasing need as fraud can be attained by either direct in-room communication or via instant messaging applications, for instance, Whatsapp and Messenger, as electronic communication is becoming more pervasive worldwide with the spread of the Internet [21].As such, handling collusion, irrespective of the communication method, is the pivotal requirement tackled in this study.
In this context, the following major questions arise: Is it possible to identify collusion fraud taking into consideration both the selected options and their timestamps?How can collusion candidates be statistically tested to minimize false discoveries?Can we further inquire into the nature of inadvertent communication between students, including its directionality (inadvertent content sharing and/or consumption) and cardinality (number of involved students)?This work offers a comprehensive discussion of these research questions.
To this end, we propose a disruptive methodology to assess fraud communication which starts from a preanalysis of the data to accommodate distinct patterns of fraudulent behavior.The methodology sustains itself in the following four major principles: 1) a statistical frame to assess the probability of pairwise student communication considering: a) matched answers; b) choice probability; c) response times (directionality); and d) recurrence of suspicious behavior; 2) a network representation of potential communication acts grounded on the previous probabilistic stance, allowing the assessment of directional multiwise communication acts in a multiple choice online test; 3) null models of compliance and fraud from the principled understanding of inadvertent communication to test collusion dynamics and identify them with strict guarantees of statistical significance; 4) scoring, clustering, and visualization principles to facilitate the understanding of inadvertent communication pathways and promote the actionability of recommendations, supporting the course's tutor with subsequent inquiry acts and advertence initiatives.As a case study, we consider online remote tests performed on the Quizzes Tutor's platform, developed at Técnico Lisboa.Students receive the questions in different orders (shuffling) and cannot return to a question they have already answered or skipped.There is limited monitoring capacity of the students' behavior.Periodic quizzes from software architecture (SA) course are used to validate the proposal.
The remainder work structure is organized as follows.Section II introduces essential background.Section III discusses relevant work.Section IV introduces the target fraud detection model.In Section V, null models of student behavior are specified to assess differences between compliant and fraudulent behavior.Section VI proposes a methodology to detect potential fraud acts against behavioral expectations and find multiwise communication channels.Section VII discusses the acquired results, comparing the target model against existing scores.Deployment notes are discussed in Section VIII.Concluding marks are finally provided.

II. PROBLEM FORMULATION
The target notation is now presented.Consider a course edition to be described by the attendees, a set of n students, S = {s 1 , . . ., s n }, and a set of questions Q = {q 1 , . . ., q g }.A question set Q is lightly used to either represent a single quiz (default), the questions from a set of quizzes, or, when accommodating confounding aspects, such as optional questions and examination shifts, the set of shared questions for a given group of students.
Let be the set of all the available options for the questions contained in Q and, in particular, k be the set of available options for question q k ∈ Q.In the context of our work, one response r ik is a pair that contains an ordered set of selections in k , y ik , performed by student s i to question q k , and an ordered set of timestamps, t ik , which are monotonically increasing given that we are targeting evaluation settings where it is prohibited to return to a previous question.The order of the first set conforms to the order of the later.
Let, in addition, r i be the set of responses r ik from student s i to the question set Q. Finally, let x ik be the last answer student s i ∈ S gave to question q k , f ik be its timestamp, x i be the sequence of all final answers to Q, and f i be the timestamps corresponding to the final answers.

NOTATION
In the context of a given question set, Q, we can obtain the grade of a student s i using, for instance, the sum of scores and the probability of correctly answering a question q k in Q as an average of scores over the set of students S Consider the input course data to be the set of all student responses, R = {r i | i = 1, . . ., n}, to the undertaken online quizzes.Given R, the targeted problem is to identify and describe fraud behavior within and across quizzes.To this end, particular care is necessary to guarantee the statistical significance of the found associations, the traceability of the undertaken fraud behavior (together with the potentially involved students), and the actionability of recommendations.

III. RELATED WORK
As precedent research shows, students cheat for numerous reasons, which are not strictly associated with online testing [22], [23].These reasons may include low grades and ineffective study strategies; poor time management skills; personal values and views which relate to achievement, fear of punishment, class attendance, and peer pressure; extrinsic versus intrinsic motivations to learn; and age [10].Ladyshewsky [10] observed that some student profiles will attempt to cheat regardless of the mode of instruction.Although earlier studies yield no conclusive evidence that remote online assessments increase cheating likelihood [24], [25], recent results show that there is a significant increase in dishonest behaviors in remote assessments [11], [23].
Ranger et al. [13] compared several cheating indicators and were unable to find indicators that could discriminate preknowledge from test collusion.Withal, the authors found out that indicators based on response times were capable to detect preknowledge but not test collusion, and indicators based on the response revisions were capable to detect test collusion but lack the power to detect preknowledge.
Of the various cheating indicators purposed in the recent Ranger et al. study [13], we highlight three statistics based on the selected responses-U1, U3, and CS-and four which are based on the editions/revisions to a given response-N1, NC1, N2, and NC2.The indicators based on the selected responses constitute the most basic way to analyze an examinee's performance since they solely require that, for each question, the option chosen by each student is saved, while the indicators based on the response's revisions are of particular interest as capable of detecting test collusion.
In our work, each student receives the questions that compose a quiz in distinct order, with shuffled possible options, and is further prevented from going back and editing responses to previous questions.As such, collusion statistics based on response revisions are insufficient.In addition, most of the previous statistics, including those in [13], do not consider the concordance of responses between students against chance agreements, which is a normal condition for communication attempts within the class.Furthermore, some of the existing indicators neglect the rich temporal frame at which responses are provided, preventing the possibility to assess the significance and directionality of potential copy acts between students.
In the context of online courses with long-duration assessments, Ruipérez-Valiente et al. [29] found that close submitters needed a statistically significant lower amount of activity in the platform to successfully complete a course.Results show that most of the student user accounts were grouped as couples of close submitters, with some large communities also observed.In a similar context, Balderas et al. [30] considered fraudulent collaboration involving an arbitrary number of students.Given students s 1 and s 2 , the targeted forms of suspicious behavior include s 2 starting examination after s 1 submission and showing a better grade/completion time ratio.The sequential rules produced under the aforementioned principles were used to produce clusters of students involved in potential fraud acts.In spite of the relevance of these studies to find multiwise collaboration patterns, they focus on a specific single form of dishonest behavior observed in long-duration assessments.
Blockchain principles to separate malicious attacks from truthful events in online systems can be arguably considered for fraud detection purposes by considering the multiplicity of fraud statistics as voters.Considering online test environments, Cai et al. [31] propose a decision schema that tackles the problems of majority voting in the presence of dishonest voters (i.e., false-positive scores of fraud) by assigning awards when a voter's report is trusted according to a peer prediction scheme.The proposed scoring scheme is incentive compatible, with a maximum attained with honest reporting [31].
Comprehensive policy assessments undertaken by Bilen and Matros [32] conclude that capturing each student's computer screen and room is pivotal to decrease fraud intentions and further recommend avoiding grading on a curve to decrease cheating behaviors motivated by peer competition.Tiong and Lee [33] developed an e-cheating intelligence agent for online assessments that is further able to access the Internet Protocol (IP) of the students, issuing alerts when students changed their device or initial location.The agent is capable of preventive behaviors as it is dynamically able to reassign questions in instances where abnormal behavior is detected.
One of the challenges of working with statistical models of fraud is the inherent difficulty of identifying the cut-off values that separate normal from atypical response profiles.Man et al. [26] placed a supervised stance on fraud detection to tackle this challenge.Using predictive learning, the authors were able to compare the discriminative power of the statistics using the collected fraud evidence and further conclude that the use of predictive models able to combine multiple sources of information can lead to a higher detection rate over traditional item response and response time methods.Alexandron et al. [34] proposed a semisupervised anomaly detection approach, trained on a known set of cheaters, to detect fraud.The approach is shown to be capable of generalizing well toward cheaters with distinct behaviors.A new time-based statistic-the fraction of items that were solved correctly in significantly lower time than the average time of correct responses on those items-is proposed to assess aberrant behaviors.
Despite the relevance of the placed (semi)supervised stances, their transfer and deployment across different courses and cultures are arguably limited as it requires the presence of expressive forms of cheating behavior from different contexts.The presence of ground truth to develop and assess academic dishonesty is a well-recognized difficulty [15].Man et al. [26] considered a case study where students had the opportunity to illegally steal exam content before assessment.Complementary cheating behavior during the exam was further flagged via postinvestigation clearance.To validate fraud detection in mixed face-in-face and online settings, Balderas et al. [30] considered the differential analysis of grades between settings to validate findings.Understandably, these assumptions disregard the fact that changes in academic performance can be undertaken with integrity and are further restricted to mixed evaluations in the context of a course or academic path.Bilen and Matros [32] and Tiong and Lee [33] validated fraud models by predefining cheating as the ability to quickly answer difficult questions.Although useful, these labeling assumptions are arguably biased and limited to specific forms of fraud and dependent on parameterizable cut-off thresholds that assume the homogeneity of student profiles.

IV. FRAUD DETECTION MODEL
A sound statistic of collusion likelihood in online quizzes, able to integrate state-of-the-art stances on answer concordance and compatible response times, is now introduced.The intuition behind the proposed model is that the underlying communication acts between students are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
generally associated with concordant answers, with the direction of communication-whether content production or content consumption-being generally reflected on the time of response associated with a concordant answer.Complementarily, the lower the concordance likelihood for a given question (unlikely item selection against overall selections), the higher the collusion likelihood.Finally, collusion likelihood further increases with recurring suspicious behavior with the same set of students within a quiz or along multiple quizzes.Ground on the aforementioned assumptions, the proposed fraud detection model measures the weight of direction-sensitive communication between students.
Let M define the communication mode between students, M = { p, c}, where p denotes content production and c content consumption.Given a set of students S and questions Q, weight: is a function that assesses the amount of communication between two students in S through the communication mode in M, in the context of a set of questions Q, possibly spanning one or multiple quizzes.
Note that weight is a total function, which means that it takes the value 0 when there is no communication of a particular type between two students.Given two students, s i , s j ∈ S, weight(s i , s j , c, Q) denotes the amount of information that s i consumed from s j , and weight(s i , s j , p, Q) denotes the amount of information that s i shared with s j .Note that Since weight is a total function, it respects the restriction To inquiry about the mode of communication between two students, access to responses' time is necessary.Recovering the problem formulation (see Section II), the response of a student s i to a question q k , r ik , is a tuple of containing the answer selections y ik and their timestamps t ik .Similarly, the time, after the start of the quiz, of s i 's final attempt x ik in a particular question q k , defined as f ik , is given by time: Given the set of possible selections , answer: S × Q → denotes the final element in the sequence of answers given by a student to a question (i.e., the response which is going to be taken into consideration when evaluating the test), previously defined as x ik .The function correct_answer: Q → presents the correct answer to a particular question.
As students can receive different sequences of questions for the same quiz, it is relevant to define their order, sequence: S × Q → N, specifying the permutation of questions per student.
The timestamps of the answer selections y ib for question q b should be greater than the timestamps y ia for question q a if q b comes after q a in the sequence of questions assigned to student s i in Q.In this context, The communication between two students, s i , s j ∈ S, for a question, q k ∈ Q, where s i is the producer, occurs when answer(s i , q k ) = answer(s j , q k ) ∧ time(s i , q k ) < time(s j , q k ). ( The functions share, consume: S × S × Q → {0, 1} determine whether or not there is a sharing or consumption between two students.The values of share(s i , s j , q k ) and consume(s j , s i , q k ) are characterized by condition 6, which also illustrates that consumption of information between two students is related with a sharing of information between the same students.
In this context, given a set Q with g questions, the estimated consumption weight between two students, s i , where α ik is a weighting factor, possibly dependent on the student profile s i and question q k .Similarly, the production weight is defined by By default, the weighting factor α ik is defined using the frequency of the response of a selected option where n is the number of students taking the exam, and I is the identity function, which returns 1 when the condition in parentheses yields True and 0 otherwise.Values of α ik closer to one indicate that the selection is uncommon, implying higher weights when both students identically select a highly infrequent item.It can be observed that the given definition fulfills the conditions stated about the weight.Additionally, the fraud score of a student is an estimator based on the weight of the most prominent communication channels held with other students.The fraud score is represented by the function score: that is defined in terms of the communication channels held between students.Consider β ∈ N + to be the number of communication channels.Given where top denotes the kth highest value in the multiset containing the mode m weights between s i and all other students, excluding itself, and 0 < β < n.
Variations to the fraud detection model correspond to different parameterizations of the weight function and values of β applied to the top function.For instance, by fixing (9) and β = 3, we are considering a fraud stance that considers the contributions from the three student communication channels with the highest production and consumption of contents.Fixing β considerably below n, β ≪ n, is suggested to remove contributions from channels with residual weights produced by spuriously concordant responses.

V. UNDERSTANDING FRAUD WITH NULL MODELS
The proposed fraud detection model in Section IV offers the possibility to quantify potential communication acts, weight them according to the likelihood of response selections, and further separate modes of communication considering response times.In the absence of fraud, communication scores can differ from zero due to the presence of spurious concordant responses.In this context, assessing expectations on the communication scores in the absence of fraud is of paramount importance to identify student-specific deviations that are potentially associated with fraudulent behavior.
To this end, this section introduces null models to understand how scores vary in the absence and presence of fraud.Accordingly, we propose null models of compliance (see Section V-A2) where students' answers are placed in the absence of communication acts, and null models of fraudulent behavior (see Section V-B) assuming explicit communication between the students within a group.The reasoning behind each null model is presented.For simplicity's sake, this analysis is pursued taking into consideration the final response each student gives to each question.For all the proposed null models, condition (5) introduced in Section IV holds.
Fig. 1(a)-(f) presents the distribution of consumption scores for each of the nonfraudulent null models.All results pre-sented were obtained using 30 simulations and considering the regularities found in the set of questions of the quiz with id 14225, performed by SA students in Quizzes Tutor's platform (details in Section VII-A).Here, the consumption score of a student is defined as the highest weight found in the set of consumption edges for that student.
The distribution of the production scores is omitted since it is similar to the distribution of consumption scores, verifying property (3).

A. Null Models of Nonfraudulent Behavior 1) Students Answer Randomly:
To gather expectations on the weight of communication channels between nonfraudulent students, it is relevant to analyze a model where all students answer in a random fashion.The amount of information that exists in this situation helps us to define a threshold that establishes when there is explicit communication between students.For this model, there are no restrictions, every output of the functions answer and time is viable.
The generation of this model is trivial, a sequence of questions is randomly (uniform) generated for each student, with incremental timestamps, and then the students' selections to each question are also randomly (uniform) picked.2) Students' Grades Influence Their Answers: In hopes of analyzing the weight of communication channels between nonfraudulent students whose performance on the test is dictated by their knowledge (grade), a null model in these conditions was created.To generate this model, for each question, each student's performance (correct/incorrect answer) is determined by the course's mark.If the answer is incorrect, the selected option is randomly chosen given the probability of picking each incorrect option obtained from real data.However, if for some question in the real data, no student picked an incorrect answer, the chosen incorrect option in the null model is uniformly selected.The timestamps are randomly generated.
Fig. 1(b) shows that the distribution of the consumption scores in this null model follows a Gaussian with a mean communication weight of 0.3.
3) Students Answer in a Predefined Order: To study the impact of different assumptions, a third null model of nonfraud is considered where students answer questions in the same order, that is, ∀ s i ,s j ∈S ∀ q k ∈Q time(s i , q k ) < time(s j , q k ) for i < j (11) but the answer given is determined by each student's course mark, to assess the maximum weight of spurious communication between two students answering in predefined order.
To generate this model, a permutation of students is computed, timestamps are predetermined by this order, and the remaining parameters are according to the second null model.
In Fig. 1(c), one can see two peaks: at 0 (student which is always the first to answer) and around 0.4 (mean spurious communication weight).
4) Students Receive the Same Sequence of Questions: It is of particular interest to study a model where students receive the same sequence of questions, which can be formulated as ∀ s i ,s j ∈S ∀ q k ∈Q sequence(s i , q k ) = sequence(s j , q k ). ( The purpose lies in identifying cases where the weight of sharing from one student to another is more evident, not because one shared information with the other, but because one answered before the other and, by chance, their options coincided.For a model generation, the sequence of questions is primarily settled.As in the previous model, each student's performance is determined by their course's mark.The timestamps for each question are randomly selected. In Fig. 1(d), showing the distribution of scores for this null model, one peak is evident, around 0.35, similar to what happens in Fig. 1(b), yet tails are now heavier.
5) Students Correctly Answer All Questions: A null model where students correctly answer all questions is also relevant to study weights under conformity and can be formulated as The timestamps dictate the weight of communication between two students.With this approach, it is possible to highlight the maximum values of weight communication between students who answer in fluctuated orders.To generate the model, it is only necessary to assign the order of questions and the corresponding timestamps.For this approach, the α ik factor, responsible to adjust a contribution in accordance with the probability of selecting a given option (9), is zero to prevent null consumption and production weights.
The distribution of scores in Fig. 1(e) identifies four peaks (0.4, 0.6, 0.8, and 1).Since every answer is correct, students agree on the chosen option in all questions.Hence, there is nonzero communication between every two students.
6) Students Answer in a Predefined Order and Correctly: This model combines the principles of the previous two conditions: conditions (11) and (13).Fig. 1(f) depicts the distribution of the acquired scores.Understandably, density peaks are associated with a 0 score (first one to answer) and 1 score (remaining, fully concordant case).
7) Final Remarks: Various null models representing nonfraudulent behavior were tested in pursuance of obtaining insights into our scoring methodology.Distinct patterns, supported by occasional arrangements of answers, promoted different score distributions.As expected, the null model in which the unique influence on students' answers is their grades [see Fig. 1(b)] is the one better resembling real dynamics of quiz answering in academic integrity.This distribution follows a Gaussian (Shapiro-Wilk at α = 1E−3), yielding statistical properties of interest.

B. Null Model of Fraudulent Behavior
The existence of collusion implies that there is at least one student sharing information and one student receiving it.Therefore, in a fraudulent scenario, it is expected that students organize themselves in groups and can communicate with each other.Following this logic, we can define two roles, leader and copycat, which are not necessarily disjoint.
Collusion may occur in the context of pairwise communication between two students, as well as within larger student groups (multiwise communication channels) where the shared contents are accessible by a community.
In this context, a leader is someone who answers a question independently and shares that information with the group.The elements of the group which are not leaders, the copycats, may be in one of the two following situations when answering a question: the question has already been answered by a leader, so they can use the shared option, or the question has not yet been answered by a leader and they can decide on whether to wait until they receive the answer from a leader.This strategy is described by Krueger as picking "a 'sacrificial lamb' to take the online test first and bring back the questions to the group" [35].
Under different pressure conditions, collusion can be observed among knowledgeable peers [32].In alternative forms of collusion, students with low performance can divide efforts in accessing and sharing external information within a single communication channel.In both the aforementioned scenarios, several leaders should be considered.In this context, we assume that if one of the leaders is about to answer a question that has already been answered by another leader, it chooses the option disclosed by the first leader.
More formally, let us define the partitioning P of the set of students S, such that ∪ P i ∈P P i = S ∧ P i ∩ P j = ∅ for P i , P j ∈ P. Each P i is then further partitioned in two groups: P il (set of leaders in group i) and P ic (set of copycats in group i), yielding P il ∪ P ic = P i ∧ P il ∩ P ic = ∅.Here, the groups of leaders and copycats are disjoint.Given a question q k ∈ Q and group P i , let us assume that every element of the group will answer the option of the first leader, that is, first_leader = argmin s∈P il time(s, q k ). ( As a result, restrictions are placed to define this model ∀ s∈P i answer(s, q k ) = answer(first_leader, q k ) ∧time(s, q k ) > time(first_leader, q k )).
1) Collusion With w Leaders per Group: This model is essential to study the communication weights between the leader students, and the remaining fellow students in a group considering that there is an explicit transmission of information between them.
The model is generated, first, by doing a partition on the set of students.Each resulting group is partitioned into two, one set describing the leaders and the other one representing the copycats.In each group, the sequence of the quiz questions is randomly generated for each student.The leaders' performance is defined as explained in previous models and the timestamps for the leaders are randomly generated.The first leader to answer each question is determined and the chosen option chosen is the selected for every element of the group.The timestamps for the copycats are randomly generated, with the restriction that each should be greater than the timestamp of the first leader to answer that question.
To assess the range of communication weights associated with likely fraudulent behavior, Fig. 2 shows the intersection point of consumption scores between the distributions of the nonfraud and fraud models (where only the copycats in the group are considered).The intersection occurs at 0.39.Scores above this value have higher density in the fraud model, while above 0.6 are only present in the fraud model.
2) Every Element of the Group Is Leader: This model is a particular case of the previous one.Communication weights highlight the explicit broadcast nature of sharing information.Students in the group access shared content, yet they also actively broadcast content.

VI. COLLUSION FRAUD DETECTION
The next step is to identify, from the group of students in the analysis, the ones whose score is indicative of fraud.These are signaled as possible cheaters and the course's professor may request an oral examination or other clearance initiatives.The end-to-end pipeline describing the proposed fraud detection methodology is provided in Fig. 3.

A. Assessing Individual Fraudulent Behavior
To understand the distribution of weights and scores, computed according to (10), in theoretical fraudulent and nonfraudulent environments, the null models presented in Section V are analyzed.The identification of potential fraudulent students is done by three methods.First, unilateral Wilcoxon signedrank testing of each student's score against the baseline scores produced under the fraudulent null model.
More formally, let Y be the random variable describing the students' scores in a fraudulent null model, which is generated by a significant number of simulations, and X be the random variable describing the students' scores in the real model, for the set of questions Q, where x i ∈ X is the score of s i , that is, x i = score(s i , m, β), where m ∈ {c, p}, β ∈ N + .Given Q, when analyzing student s i , we are interested in the variable Z i = Y − x i , which is obtained by subtracting the score of s i in Q, x i , to the observations drawn from Y .
Consider the random samples y and z i obtained from the described random variables.Since we wish to identify scores with upward deviation, the null hypothesis is defined to state that the median of z i is positive (the score of the student in the analysis is smaller than the scores of the students in the null model) against the alternative (the score of the student in the analysis is greater than the scores of the fellow students in the null model) If the score of a particular student is above the median of scores obtained in a scenario where fraud is present, we may inquiry, with some confidence, that they may be dishonest.Otherwise, we can hypothesize, under some confidence, that they may have not committed fraud.The output p-value is then assessed at a significance level (1%).Students with p-values below this threshold should be signaled for postanalysis.
An alternative method is to use interquartile range (IQR), or an alternative outlier statistic, inferred from the interval of scores computed from the real data.Students with outlier scores above the higher bound of the interval are noted as possible cheaters.
A third alternative is to rely on the intersection point of the score curves given by the null model of the nonfraud and null model of fraud, identifying as devious students those whose scores are above the threshold defined by this point.
In the end, students are identified as fraudulent (against the reference null models) if H 0 on ( 16) is rejected for the chosen significance level; the student's score is above the higher bound of the IQR interval; and the score is above the threshold defined by the intersection point between null models in the absence and presence of collusion; and as nonfraudulent otherwise.

B. Fraudulent Group Identification
Collusion fraud often involves more than two students who establish a channel of communication to inadvertently share Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and access information [30].In online assessments, although collusion can occur via physical channels when students choose to occupy the same space, messaging platforms are believed to be the most common communication channel.Parks et al. [36] examined how social media can promote a collective movement toward cyber-cheating, identifying motivations, and channels for group collusion.
The seek for fraudulent groups may be interpreted as an unsupervised machine-learning problem since it is unknown a priori who they are but their characteristics may reveal some evidence.Clustering is here suggested to identify groups of users with dense connections.This is particularly relevant since students may organize in groups to facilitate collusion behaviors.
With this analysis, it is possible to study the number of students per cluster.This information can guide the parameterization of the null models in Section V-B.For simplicity, let weight(s i , s j , p, Q) = weight(s j , s i , c, Q) (by definition) be denoted as w s i s j Q and weight(s i , s j , c, Q) = weight(s j , s i , p, Q) (by definition) be denoted as w s j s i Q .
To apply clustering algorithms to the data, a similarity measure should be defined.Here, we use sim(s i , s j , Q) = w s i s j Q + w s j s i Q (17) where s i and s j are students in S, and Q is a set of questions.It computes the amount of communication between these students by summing the weights of consumption and production.
As a result, we pretend to group students with a potentially high transmission of contents, either corresponding to high values of production, consumption, or, in accordance with the introduced similarity measure, their sum.On the other hand, cheating is frequently unidirectional, given that a student helps another one.In this case, it is relevant to consider the maximum communication weight between them.In this context, the similarity could be alternatively defined as This way, the focus resorts to the one-way transmission of information: production or consumption.Either way, strong connections between candidate students or clusters of students are indicative of fraud predisposition and therefore students can be grouped together.
The dissimilarity between two students can be defined as where sim_max(Q) is the maximum value in the similarity matrix.The proofs that ( 17) and ( 19) measures are valid (dis)similarities can be found in the Appendix.
With the aim of clustering fraudulent communities of students, similarity and dissimilarity matrices are produced using the described formulas.Then, agglomerative hierarchical clustering methods with Single and Average linkage are suggested to identify groups of students yielding either local or spread interactions with peers in a communication channel.

A. Case Study
We consider online remote tests performed on the Quizzes Tutor's platform, developed at Técnico Lisboa, as a study case.In particular, we analyze quizzes from two courses: SA, lectured in the first Semester of 2020/2021, and software engineer (SE) lectured in the second Semester of 2020/2021.Due to space limits, SA results are primarily discussed.The design of the quizzes ensures that students receive the questions in distinct order, with shuffled possible options, and further prevents students to go back and edit responses to previous questions.The quizzes are designed to be a part of the course's continuous evaluation, so their final contribution to the course grade is low, known to be associated with a lower tendency to cheat [10].All the quizzes are performed at the end of each lecture, twice per week, with five MCQs and four options each.The order in which the questions and options appear to each student is randomly chosen.SA students had 6 min to complete each exam, whilst SE students had 5 min.

B. Fraud Detection Experiments
The principles placed along Sections IV-VI form a methodology to assist tutors detecting collusion events (see Fig. 3).Once the network with directional communication weights is computed, a natural subsequent step is the identification of collusion groups, that is, to detect if students are organized in groups with the aim of sharing or accessing real-time information about an exam when it is taking place.To this end, the introduced clustering stance (see Section VI-B) is pursued.Major results on SA are presented.
Considering S to be the set of students taking the SA course and Q to be the set of questions of the quiz with id 14225.Fig. 4 illustrates the clustering case with average linkage.A color threshold of 0.2 is set to better separate clusters whose dissimilarity between their elements is below this value.It is possible to identify a cluster of students 810 and 19979, potentially involving student 19988 (dark green); and a cluster of students 13089 and 19867 (orange).Although fraud appears to occur in compact groups of two or three students, larger communication channels should not be excluded at this stage to study the different possibilities of collusion.We now assess whether these associations deviate from expectations.This is a subsequent step in the methodology.To this end, let us assume students organize themselves in groups of three where two consume content from the remaining student that acts as content producer.In this context, we can fix a null model of fraud (expectations of fraudulent behavior) assuming the size of communication channels and the number of content producers.Results on the presented null models were obtained using 30 simulations and considering the regularities found in Q, the set of questions of the SA quiz with id 14225.
Fig. 5(a) presents the distribution of the weight of associations in the respective null model of fraud, where the weight corresponds to the likelihood of fraudulent behavior.Lower values of weight (around 0.1) have higher density, which is expected since these weights correspond to interactions between students outside collusion groups.As such, their communication is reduced when compared to the communication between students of the same group, yielding higher weights (approximately 0.35) and lower density.Fig. 5(b) depicts the distribution of association weights in the null model where the unique influence on the answers is the grade.Finally, Fig. 5(c) presents the distribution of association weights on the real data.The computed weights combine both channels of communication: production and consumption.Peaks are observed in lower weights, around 0.05 and 0.15, which appear to indicate that the communication is, in general, artificial and no fraud has been committed, or else we would observe higher density around 0.35 values in accordance with the null model of fraud.Fig. 6 provides the distribution of the weight function within a fraudulent group against association weights outside a group, revealing significant differences between the weights of edges connecting students in the same group and edges connecting students of different groups, as expected given the pursued null definition of fraud.
In previous examples, where we consider the presence of groups of three students with one leader, the score of fraud of a student in this quiz is computed according to (10) with β = 1.If there is evidence of access to multiple content producers (leaders), β can be increased in accordance.
In the presence of expectation levels on what is likely a fraudulent behavior (e.g., scores above the intersection point  between null distributions in Fig. 6), we can now move to the comprehensive network-based view of associations to assess collusion between pairs of students.Fig. 7(a) shows a representation of the communication between students in the null model of fraud with groups of three and one leader.
Here, the leaders and copycats are easily distinguished as the former are represented by big purple circles (as they produce more than consume) and the latter by big pink circles (higher consumption).In Fig. 7(b), representing the communication between students in the null model of nonfraud where grades are the unique influence to the responses, circles are smaller than the ones presented in the previous graph as the existent communication between students is spurious.Understandably, the difference between producers and consumers is less evident.Fig. 8 provides the network model from the real answers to quiz 14225, SA.Generally, nodes are generally smaller than in previous networks, and the differences between consumption and production are subtle, indicating that, if existent, the occurrences of fraud in quiz 14225 are scarce.
Fig. 2 (in Section V-B) showed the presence of statistically significant differences between the scores in the null models of fraud and nonfraud, as theoretically expected.Complementarily, we now assess how fraud intersection thresholds vary for communication channels with a higher number of consumers and producers (leaders).Tables II-V show that for a fixed group size, the intersection point is lower when the number of leaders is higher.The mean and standard deviation of the curve correspondent to the null model of nonfraud for consumption edges is 0.3129 and 0.0696, respectively.For a fixed number of students in a group, the greater the number of leaders, the closer the scores to the null model of fraud, hampering the separation of behaviors, as further illustrated in Fig. 9(a) and (b).
Fraud detection is the final step of the proposed methodology.Decisions under α = 0.1 significance levels are illustrated   in Fig. 10 for the SA quiz 14225.Exploring consumption scores [see Fig. 10(a)], we observe the presence of one student with scores higher than the median score in the reference null model of fraud; and one student with a score above the   higher bound of the IQR interval and the intersection of score curves.The analysis of production scores further indicates the presence of a student potentially involved in content sharing.Consider now Q to be a set of ten quizzes in the SA course.The sets S and S ′ are left unchanged.In Table VI, we present the number of students in each category of fraud for each quiz.The first column refers to the result of applying the statistical test in ( 16), using the null model of fraud with groups of three students and one leader; the second to the  intersection point between the curves representing the null models of fraud and nonfraud; and the third to the IQR metric computed over the scores obtained using the real data.The fourth column contains the number of incidences which did not verify any previous criteria.The acquired results reveal quiz 14409 to be associated with the highest potential fraud acts against the null model of fraudulent behaviors (four occurrences).Quiz 14331 had the highest number of students, 7, with scores above the intersection point between the score curves of the fraud and nonfraud null models, followed by 14265.Taking into consideration outlier IQR statistics, the highest number of fraudulent students was identified in quiz 14333 (five occurrences).For all quizzes, the majority of the students were designated as nonfraudulent.
Table VII depicts, for each student, the number of quizzes with a fraud occurrence per criterion.A random sample of ten students is considered.Tables VI and VII refer to copy acts (consumption scores).Students 1012, 19930, and 19939 were the only ones signaled as fraudulent with respect to some fraud categories.In particular, students 1012 and 19930 were determined as fraudulent by the three introduced metrics.In the majority of the quizzes, fraudulent students were not encountered.

C. Computational Complexity
Given n students and g questions, the time complexity to compute the item selection probability for all questions is O(ng), the answer precedence between two students from their timestamps is O(g), the posterior weight calculus [see (7) and ( 8)) is O(g), the network inference is then O(np + n 2 p) = O(n 2 p), and the subsequent scoring of all students in the network according to (10) is O(n 2 β).Accordingly, the principled generation of quiz answers and subsequent description of null models is O(kn 2 (β + g)), where k is the number of simulations.The fraud detection step against the precomputed null model thresholds is linear on the number of students and null models, hence the overall time complexity is O(kn 2 (β + g)), with k=1 for precomputed null models, and the memory complexity is O(n 2 ).
To measure performance, we performed load tests using the deployed fraud detection system at the Quizzes Tutor platform (see Section VIII).Two tests were done on a server running Ubuntu 18.04.3LTS with four cores and 16 GB of RAM.The data necessary to assess fraud was obtained for each one of the quizzes of the two courses, SA and SE.
Table VIII presents the results.The latency has an average between 1 and 1.7 s, where the difference is due to the number of students per quiz (average 45 and 81, respectively, for each one of the courses).We consider the values acceptable for the teacher to wait until she can start analyzing the results.Although the total number of quizzes significantly differs between the two courses (24-78), the analysis of the different values consistently shows that latency correlates with the number of quiz answers (number of students answering the quiz), in conformity with the aforementioned time complexity.

VIII. FRAUD DETECTION SYSTEM: DEPLOYMENT AND VALIDATION
The proposed fraud detection methodology, implemented in Python, is made available as an analytical module in the Quizzes Tutor platform. 1This platform is frequently used for online quiz assessments by several courses at Instituto Superior Técnico, Universidade de Lisboa.The provided fraud detection facilities have undertaken successful deployment and validation stages, being available to the academic community with the necessary disclaimers for the adequate use and limits of actionability.
The deployed instance generates the communication network considering all answers, both correct and incorrect, applying the real model.Production and consumption scores Interactive violin charts for inspecting students with deviant consumption and production scores, available at Quizzes Tutor.
are calculated using pairwise communication channels by default, that is, β = 1.
Visualizations with strict usability guarantees are provided to aid the analysis of critical cases.Fig. 11 shows the consumption and production scores violin charts for a fivequestion quiz.The teacher can interact with the graphs to obtain information about a particular student, for instance, an outlier student with deviant scores.In the given example in Fig. 11, the student name is anonymized.

IX. CONCLUSION
This work introduced a novel methodology to assess likely fraud communication acts in remote online MCQ exams based on the concordance of responses and answer times.Null models are produced to understand regular versus fraud dynamics and to identify collusion with strict guarantees of statistical significance.Complementarily, clustering algorithms are applied to unravel communication channels between students.Considering matched answers, choice probability, response times (directionality), and recurrence, we show that is possible to create a network of potential communication acts between students.Having constructed the network for null models representing fraudulent and honest behavior, we obtain insights into how to separate spurious communication from the actual interchange of information.Finally, employing these insights on the real data, and making use of scoring techniques, we are able to categorize each student with respect to their fraud likelihood and thus understand inadvertent communication pathways and promote the actionability of recommendations, supporting the course's tutor with the subsequent inquiry or advertence initiatives.
The application of the proposed principles in the context of the SA course reveals students with a higher fraud likelihood, already showing to be a solid criterion to guide tutors in ascertaining collusion and discouraging communication.
In this work, fraudulent behavior analysis was primarily pursued in the context of a single quiz.However, if deviant behavior is detected in more than one quiz, the chances of fraudulent behavior considerably increase.In this context, binomial testing can be straightforwardly applied to identify the probability of observing a given number of potential fraud acts.
The reported findings open guidelines to establish both preventive and reactive policies for fraud control.The disclosure of the proposed fraud detection model prevently demotivates collusion acts.The acquired results further support the role of assigning distinct orders of questions, shuffling item options, and preventing reverse editions.Complementary strategies should be considered, including continuous authentication to prevent impersonating, online proctoring (whether human or automated) to promote academic integrity [22], stratified exam contents (e.g., pools of alternative questions), attitude formation (e.g., emphasis on learning, formative assessments), and cheat-resistant software facilities, including browser tab lockers [37], IP change detectors [33], and wireless jammers [38].

Fig. 1 .
Fig. 1.Null distribution of consumption scores.(a) Students answer randomly.(b) Grade is the unique answer influence.(c) Students answer in a predefined order.(d) Students receive the same question sequence.(e) Students correctly answer all questions.(f) Students correctly answer questions in a predefined order.

Fig. 1 (
a) describes this null model, showing density peaks and communication weights likely contained in [0.3, 0.45].

Fig. 2 .
Fig. 2. Distribution of consumption scores: the null model of nonfraud versus null model of fraud with groups of 3 and one leader.

Fig. 3 .
Fig. 3. Major steps of the proposed fraud detection methodology.

Fig. 5 .
Fig. 5. Distribution of association weights (potential acts of fraud) considering the regularities of AS quiz 14225.(a) Null model of fraud with groups of three students and one leader.(b) Null model where grades are the unique response influence (no fraud).(c) Real student data.

Fig. 6 .
Fig. 6.Distribution of weights in the null model of fraud with groups of three students and one leader, within and outside a group.

Fig. 7 .
Fig. 7. Communication graphs.(a) Students in the null model of fraud (groups of three and one leader).(b) Students in the null model where grades are the unique influence on the responses (no fraud).

Fig. 9 .Fig. 10 .
Fig. 9. Distribution of scores in the null models of nonfraud and fraud with groups of 6 and distinct number of leaders.(a) Consumption scores.(b) Production scores.

Fig. 11 .
Fig. 11.Interactive violin charts for inspecting students with deviant consumption and production scores, available at Quizzes Tutor.

TABLE II INTERSECTION
POINT OF SCORE CURVES OF NONFRAUD (GRADES CONSIDERED) AND FRAUD FOR CONSUMPTION EDGES

TABLE III AND
DEVIATION OF CONSUMPTION SCORES (NULL MODEL OF FRAUD)

TABLE IV INTERSECTION
POINT OF SCORE CURVES OF NONFRAUD (GRADES CONSIDERED) AND FRAUD FOR PRODUCTION EDGES

TABLE VI DETECTED
FRAUDS PER QUIZ (CONSUMPTION MODE), SA

TABLE VII DETECTED
FRAUDS PER RANDOMLY SELECTED STUDENTS (CONSUMPTION MODE), SA

TABLE VIII PERFORMANCE
MEASUREMENT FOR FIVE QUESTION QUIZZES