Distributed clustering algorithm for adaptive pandemic control

The COVID-19 pandemic has had severe consequences on the global economy, mainly due to indiscriminate geographical lockdowns. Moreover, the digital tracking tools developed to survey the spread of the virus have generated serious privacy concerns. In this paper, we present an algorithm that adaptively groups individuals according to their social contacts and their risk level of severe illness from COVID-19, instead of geographical criteria. The algorithm is fully distributed and therefore, individuals do not know any information about the group they belong to. Thus, we present a distributed clustering algorithm for adaptive pandemic control.


I. INTRODUCTION
C OVID-19 [1] is a disease caused by the new coronavirus SARS-CoV-2. It was declared a pandemic by the World Health Organization (WHO) in March 2020. First cases were reported in Wuhan, People's Republic of China, to the WHO on December 31st 2019. Since then, 117.7 billion cases have been reported, with more than 2.6 billion deaths, as of March 10th, 2021 [2]. Those at a higher risk of severe illness from COVID-19 include those aged 60 or over, or with underlying medical problems such as diabetes, cancer, or high-blood pressure. Nevertheless, this highly infectious disease can affect anyone, and can become deadly at any age. Personal health precautions are strongly advised, mainly wearing a mask, physical distancing and handwashing [1].
In response to the pandemic, governments all over the world have implemented non-pharmaceutical measures in order to stop the spread of the virus, or flatten the curve. Social distancing interventions, such as isolation and quarantine of infected patients and their contacts, external and internal border restrictions, workplace distancing, closure of schools, and complete quarantine or lockdown have been the most common [3], [4]. FluTE, a stochastic influenza pandemic simulation model [5], was used to assess the potential effect of different social distancing interventions using Singapore as a study case [6], since it was among the first to report infections. The model predicted quarantine or lockdown to be the most effective measures, particularly combined with school closures and workplace distancing. In fact, Singapore successfully implemented these measures, preventing community spread [6]. It is important to point out that these measures are targeted geographically [7]. This geographical approach affects large population groups, regardless of their economic sector or activity. Therefore, these measures have severe consequences on the regional, national and global economy: they pose a risk of reduced income or even job loss, affecting the most disadvantaged populations [8], and results show an average 2.5-3% global GDP drop per month of complete lockdown [9]. This shows that, despite lockdown and quarantine being the most effective measures, a different non-geographical approach should be taken in order to overcome the aforementioned negative impacts. Furthermore, these measures are most efficient when applied to individuals that belong to groups where transmission is most likely to occur [10]. Hence, individuals should be grouped according to their social contacts, which might not necessarily coincide with geographical areas. However, if the criteria are not geographical, it is more difficult for individuals to know which group they belong to. Furthermore, such groups may change with time and adaptive grouping strategies are needed.
Public health experts across institutions and countries have identified digital tracking measures as useful tools to survey and slow down the spread of the virus. Numerous technolo-gies have been developed with this purpose, such as digital health certificates, which assign a color-coded COVID status to their users, physical surveillance initiatives [11], symptom checkers, or flow modelling tools, which quantify and track people's movements in specified geographical regions [12]. These technologies, however, raise severe ethical concerns about putting user's privacy and security at risk. For instance, out of the 65 digital health certificate applications that are currently in operation globally, 82% are considered to have inadequate privacy policies [11].
One of the most common examples of digital tracking measures are proximity or contact tracing tools, mainly via mobile applications. In particular, studies have predicted them to be beneficial in mitigating the spread of the virus, specifically during the de-escalation of physical distancing [13]. There are over 120 contact tracing applications currently available in over 70 countries [11]. These contact tracing tools gather data from their users, such as their location, their health records or contact information. This has raised ethical concerns surrounding the privacy of users and their data.
For instance, one of the earlier contact tracing tools developed was Singapore's TraceTogether [14], a mobile application which operates via Bluetooth connection. Nearby phones, with Bluetooth and TraceTogether open in the background, exchange tokens, which are stored encrypted in each phone and in a central server [15]. If a user tests positive for COVID-19, contact tracers can easily use the tokens to identify those at high risk of infection. TraceTogether does not gather more than the necessary information, only the users' contact/mobile number, identification details and random ID. The tokens sent via Bluetooth are time-varying random strings, and this way, privacy between users is kept. However, when a user is infected, the government can retrieve all mobile numbers of the individuals the infected user has been in contact with [15]. Having this centralized approach leaves no privacy for users from authorities.
For overcoming the privacy concerns of a centralized approach, in an unprecedented joint effort Apple and Google developed a contact tracing platform based on Bluetooth [16]. Specifically, they developed an application programming interface (API) that allows interoperability between Android and IOS. This API requires contact tracing applications to take on a decentralized approach. The contact matching analysis is performed at a local level, which also protects users' privacy, maintaining their anonymity. Over 37% of contact tracing applications now use Apple and Google's API [11].
In this paper, we propose a distributed algorithm that adaptively groups individuals (i.e., creates clusters) according to their social contacts and their risk level of severe illness from COVID-19. This will be modelled as a doubly-weighted undirected graph. Moreover, by combining our algorithm with a distributed consensus algorithm, each individual can know the epidemiological situation of the group they belong to and can take the social distancing measures that corre-spond to the epidemiological situation of their group.
There exist many algorithms to create clusters and, in particular, many works about privacy-preserving clustering have been conducted (see, e.g., [17]- [21]). These works are based on statistical or cryptography techniques to protect data. Our algorithm can use some of the abovementioned techniques for becoming privacy-preserving between nearby users, but since it is fully distributed individuals do not share any information about the cluster they belong to even if no cryptographic methods are used. Therefore, privacy from authorities is kept. That is, only the individuals themselves know which group (cluster) they belong to without having knowledge of the rest of the members of the group.
In the literature, many works deal with distributed clustering of data using a wide variety of techniques and applying the results to different fields (see, e.g., [22]- [29]). In this paper, we focus on spectral clustering techniques because they are easy to implement and have been shown to be more effective in finding clusters than some traditional algorithms such as k-means [30]. Among the previously cited works, [27]- [29] present a similar approach to the one considered in this paper. Specifically, in [27] the authors propose a distributed spectral clustering algorithm but they do not consider weights neither in the nodes nor in the edges. In [28], the authors propose a distributed spectral clustering algorithm but they only consider an edge-weighted graph. Finally, in [29] a spectral clustering for doubly-weighted graphs is proposed but, unlike here, the algorithm is not distributed.
The remainder of this paper is organized as follows. Section II states preliminary considerations regarding distributed computation and spectral clustering. Section III presents the distributed clustering algorithm for adaptive pandemic control, its convergence speed, and its computational complexity. Finally, two illustrative examples and some conclusions are given in Sections IV and V, respectively.

A. DISTRIBUTED COMPUTATION USING A LINEAR ITERATIVE ALGORITHM
Consider a network composed of n nodes, where each node represents the mobile phone (or similar) device of one person. The entire population and the interactions among them can be viewed as an undirected graph G = (V, E) with no loops, where V = {1, 2, . . . , n} is the set of nodes (vertices) and E is the set of edges. If two nodes i, j ∈ V interact between them, then {i, j} ∈ E. We say that these nodes are connected, and can therefore interchange information. Conversely, if {i, j} / ∈ E, this means that nodes i, j ∈ V are not connected and cannot interchange information.
We assume that each node i ∈ V has an initial value x i (0) ∈ R, where R denotes the set of real numbers. In distributed computation each node computes its target value by interchanging information with its neighbouring nodes. The approach that will be considered here for distributed computation is to use a linear iterative algorithm of the form (1) where w i,j ∈ R and time t ∈ {0, 1, 2, . . .} is assumed to be discrete (see [31]). Let x(t) = (x 1 (t), x 2 (t), . . . , x n (t)) be the column vector with the values of the nodes at time instant t, where denotes transpose. The linear iterative algorithm (1) can then be written as where W is the n × n matrix defined as for i, j ∈ V.

B. SPECTRAL CLUSTERING
Clustering a graph consists in separating the nodes of the graph into disjoint groups (clusters). There exist many algorithms for graph clustering. The approach that will be considered here is the so-called spectral clustering (see, e.g., [32]- [34]). Spectral clustering is based on the information provided by the eigenvectors of the Laplacian matrix of the graph [35], mainly by an eigenvector corresponding to the smallest nonzero eigenvalue of such matrix, known as Fiedler vector [36].
In this paper G is assumed to be a doubly-weighted graph, that is, a graph with weights both in the nodes and in the edges. We denote with δ i > 0 the weight of node i, for i ∈ V, and whenever {i, j} ∈ E we denote with σ i,j > 0 the weight of such edge.
In [29,Lemma 1], in the context of doubly-weighted graphs, the notion of weighted Laplacian matrix was presented. The weighted Laplacian matrix of the graph is the n × n matrix given by where . . , λ n )U −1 be an eigenvalue decomposition of L, where the eigenvalues are arranged in non-decreasing order and the eigenvector matrix U = [u 1 |u 2 | . . . |u n ] is real and orthogonal. Assume that G has k components. Then, λ 1 = . . . = λ k = 0. In [37, Section 5.1] it is shown that [u k+1 ] i indicates which cluster the node i belongs to.

A. PROPOSED ALGORITHM
Consider a set of n individuals that interact in a certain geographical region. The entire population and the interactions among them will be modelled with a doubly-weighted undirected graph G with no loops. The node i of the graph represent the i-th individual and the weight of the node i, δ i , represents the individual's risk level of severe illness from COVID-19. The edge {i, j} of the graph represents that there exists an interaction between individuals i and j, and the weight of the edge, σ i,j , represents the time frame of the social contact between them.
In this section we present an algorithm that adaptively groups individuals according to their social contacts and their risk level of severe illness from COVID-19, that is, we present an algorithm for clustering the doubly-weighted graph G. Since the goal is to keep privacy from authorities, the algorithm presented here is fully distributed. Specifically, it computes the eigenvector u k+1 of the Laplacian matrix L of the graph G in a distributed way (see Algorithm 1).
The rest of this section is devoted to proving that u k+1 can be computed in a distributed way (Theorem 1). Theorem 1 directly provides the steps of Algorithm 1. Theorem 1: Consider a doubly-weighted undirected graph G with no loops, n nodes, and k components. Let the Laplacian matrix L of the graph G be as in (4) with λ k+1 < λ k+2 . Then, for almost every column vector x(0), C is a non-zero constant, and I n denotes the n × n identity matrix.
Proof: See Appendix A.
In the rare case in which λ k+1 = λ k+2 the Fiedler vector would not be unique, meaning that it might be any vector in a subspace of dimension larger than one. In this rare case, Algorithm 1 would still work because it would converge to one of such vectors.
Observe that the iterative equation (6) can be computed in a distributed way since it is of the form of (2), and I n − 1 λn L satisfies (3). Therefore, from (5) each node i ∈ V can know the i-th entry of an eigenvector associated to λ k+1 . However, in order to compute (6) in a distributed way, each node needs to know λ n . Lemma 1 shows that λ n can also be computed in a distributed way. Lemma 1: Consider a doubly-weighted undirected graph G with no loops, n nodes, and k components. Let the Laplacian matrix L of the graph G be as in (4). Then, for almost every real n-dimensional column vector y(0), Observe that the iterative equation (8) can be computed in a distributed way since it is of the form of (2), and L satisfies (3). Therefore, from (7) each node i ∈ V can know λ n .
It should be mentioned that the distributed computation of u k+1 can be found in [28], but only for an edge-weighted graph, that is, for the particular case in which δ i = 1 for all i ∈ V.
We finish this section by describing Algorithm 1. For ease of notation, we define which is the t-th iteration of (1) and can clearly be computed in a distributed way. As for Algorithm 1, we fix t 0 to be the number of iterations of (1) required for a desired precision. Table 1 describes Algorithm 1 and relates it with the theoretical aspects shown in this section. Observe that Algorithm 1 separates the nodes of the graph into two clusters. However, if the algorithm is used recursively within each cluster, we can separate the nodes of the original graph into as many clusters as desired.

B. CONVERGENCE SPEED
In this subsection we study the convergence speed of the proposed algorithm. Specifically, we show that the convergence of the sequences considered in Theorem 1 and Lemma 1 is linear. We recall that the convergence of a sequence a 0 , a 1 , a 2 , . . ., which converges to , is said to be linear if the limit lim t→∞ |a t+1 − | |a t − | is a nonzero constant (see [38, p. 224]).
The following theorem deals with the convergence speed of the sequence considered in Theorem 1. Theorem 2: Let x(t) be as in Theorem 1. Then, the convergence of the sequence is linear for all i ∈ V.  Since the convergence of the sequence considered in Lemma 1 is also linear (see [38,Section 5.8.1]), we conclude that the overall convergence of Algorithm 1 is linear.

C. COMPUTATIONAL COMPLEXITY
The computational bottleneck in spectral clustering is the computation of the eigenvectors of the Laplacian matrix. To speed up the computation of such eigenvectors, the power iteration method is usually used [40].
In this subsection we study the computational complexity of Algorithm 1 for each node. The computational complexity of Algorithm 1 is essentially determined by the complexity of running twice the power iteration method. In particular, the power iteration method is used to compute the largest eigenvalue of L (see line 11 of Algorithm 1) and to compute an eigenvector associated to the largest eigenvalue less than one of I n − 1 λn L (see lines 21-22 of Algorithm 1). The power iteration method is computationally expensive for large matrices but L and I n − 1 λn L are sparse matrices with only a few non-zero entries. This reduces the computational difficulties, as subsequently explained.
Let c i be the number of contacts the i-th individual has. It is important to remark that c i does not depend on n. Consequently, regardless of the value of n, the i-th row of L will have at most c i + 1 non-zero entries. Therefore, the computation of [f (y, t 0 )] i needed in line 11 requires no more than t 0 (c i + 1) multiplications (see Equation (1)). Similarly, the computation of [f (x, t 0 )] i needed in lines 21-22 requires no more than t 0 (c i + 1) multiplications.
Observe that t 0 controls the precision of the power iteration method and is usually not larger than 100 even for a very large n. Moreover, in [41] it is shown that even if n increases, t 0 does not need to increase faster than O(log n) to keep the same precision. Consequently, in the worst case scenario, the computational complexity of Algorithm 1 is O(log n), which makes it suitable for a large n.
Finally, observe that regarding the memory usage of the algorithm, node i only needs to store c i + 1 values (the i-th row of L) and therefore the storage requirement of each node does not increase with n.

IV. ILLUSTRATIVE EXAMPLES
In this section we present two examples to illustrate how Algorithm 1 works.

A. EXAMPLE WITH RANDOMLY GENERATED DATA
In this example, we randomly generate a graph G that models a set of n = 20 individuals and their interactions. We consider two scenarios. In Scenario 1 (see Figure 1a), we assume that there is no information available about the risk level of severe illness from COVID-19 of each individual, nor about the time frames of their social contacts. Hence, we fix the weight of node i, δ i = 1, for all i ∈ V. We also assume that all the social contacts have equal time frames and therefore we fix the weight of the edge {i, j}, σ i,j = 1,  for all {i, j} ∈ E. In Scenario 2 (see Figure 2a), we consider the same graph G, yet we assume that there is information available about the individual's risk level and time frames of the social contacts. Such information is randomly generated both for the nodes and for the edges. In particular, all the weights are drawn from a uniform distribution between 0 and 1. Figures 1b and 2b show the 2 clusters created by a single run of Algorithm 1 for Scenario 1 and Scenario 2, respectively.
Observe that the algorithm does not strictly separate the higher and the lower risk individuals. The clusters made by our algorithm depend on the risk of severe illness but also on the social interaction among individuals.

B. EXAMPLE WITH REAL DATA
In this example, we use data from the CoMix study [42] to generate a doubly-weighted graph G that models a set of n = 35 individuals. This study follows households all over Europe, collecting information about their behavioural patterns, measures, and proximity contacts, and how these have varied over time during the course of the COVID-19 pandemic. These results are published for an easier assessment of the spread of the virus, and they maintain the anonymity of the participants. For this example, CoMix social contact data  from Spain were used [43].
From these data, n random participants are selected. CoMix social contact data provides for each participant their number of contacts and the time frame of such contacts. We have further assumed that all the contacts of the selected individuals are within the considered population. We fix the weights of the nodes and the weights of the edges using the information provided by CoMix social contact data as shown in Tables 2 and 3, respectively. Figure 3b shows the 2 clusters created by a single run of Algorithm 1 for the considered example.

V. CONCLUSION
In this paper, we have presented a distributed clustering algorithm that groups individuals according to their social contacts and the risk level of severe illness from COVID-19. Once the clusters are made, using a distributed consensus algorithm in each cluster, each individual can know the epidemiological situation of the group they belong to. Such knowledge allows them to take the social distancing    measures that correspond to the epidemiological situation of their group. By using this algorithm, the social distancing measures would only affect groups with high risk of infection instead of entire geographical regions, thus reducing the economic damage. The algorithm is designed so that individuals could know which group they belong to without having knowledge of the rest of the members of the group. Furthermore, there is no central entity with information about the groups because the algorithm only runs at a local level. Groups are created taking into account social contacts and the risk level of severe illness. Since social contacts change continuously and the risk level of severe illness also changes with the vaccination progress, our adaptive algorithm enables the creation of groups according to the information available at the time it is run. Finally, after the computational complexity analysis, we have concluded that our algorithm is sublinear with respect to the population size, which makes it very efficient. .