Accurate and Privacy-Preserving Person Localization Using Federated-Learning and the Camera Surveillance Systems of Public Places

In this paper we propose an accurate and privacy-preserving scheme that enables a law enforcement agency to locate persons of interest using the camera surveillance systems of public places. Comparing to the existing schemes that measure the Euclidean distance to locate persons using their embedding vectors storing facial features, we use a more accurate approach by training a machine learning model. Moreover, to avoid leaking sensitive information by sharing the images of the public places’ visitors to train the model, we use a federated learning technique to compute the model in a privacy-preserving way. The model is designed in such a way that makes executing it over encrypted data efficient. Specifically, the model is executed by three parties as follows. Each public place computes an embedding vector for each visitor’s image and inputs it to a neural network and encrypts the output using a modified inner product encryption scheme and sends the ciphertext to a cloud server. The law enforcement agency does the same steps on the images of persons of interest. Finally, the server uses these ciphertexts to evaluate the last layer of the model by computing the inner product of the two vectors over encrypted data. The cryptosystem enables the server to compute the inner product of two vectors using their ciphertexts without being able to learn the vectors. We have modified an encryption cryptosystem that is designed for a single public place and a single law enforcement agency to make it more efficient in our application that has multiple public places. To evaluate our scheme, we have conducted extensive experiments and the results confirm that our model is accurate in locating persons of interest with low communication and computation overhead. A formal proof and analysis are used to demonstrate the ability of our scheme to preserve privacy.


I. INTRODUCTION
Person localization is important for several applications, such as locating persons of interest by a law enforcement agency. Face recognition is the most popular technology for per-The associate editor coordinating the review of this manuscript and approving it for publication was Jerry Chun-Wei Lin . son identification because it is non-invasive and does not need the cooperation of the persons comparing to other identification technology such as iris and fingerprint [1]. Recently, deep-learning-based face recognition approaches, such as [2], [3], [4], [5], [6], [7], [8], [9], [10], have been used. In these approaches, machine learning models are trained on a massive amount of data, which allows them to learn the characteristics of expression, illumination, and angle. These approaches first localize the area of the face in the input image and then determine the locations of the landmarks in the face to produce an embedding vector which represents the face's landmark points of the brows, mouth, eyes, jawline, and nose. Finally, for person identification, the distances between an embedding vector and a database with embedding vectors of known persons are measured (e.g., using the Euclidean distance) and two vectors are deemed for the same person if the distance is below a pre-defined threshold [11], [12], [13]. However, this simple approach does not give accurate results especially when the images are taken by different cameras. In this paper, we conduct experiments to evaluate the accuracy of using the Euclidean distance to match the embedding vectors of images taken by different cameras. Our results indicate that this approach is not accurate and it is hard to find one threshold that can give good results for the images of all sources.
Moreover, the closed-circuit television (CCTV) surveillance systems are currently used in almost all public places. Using face recognition technology with the CCTV systems of public places to localize persons of interest is a low-cost and effective approach to fight crime by locating wanted or suspected individuals. However, the use of face recognition technology with the CCTV systems raises serious privacy concerns [14], [15] because the system can be misused to monitor people's daily activities by collecting information on the locations visited by them. The recent privacy breaches in several systems made the public worried about their privacy. Examples of these breaches include exposing the personal information (such as face photos, addresses, age, etc.) of millions of people collected by major Chinese surveillance service providers [16], [17], and charging Facebook $550 million for collecting facial data without authorization [18]. Due to privacy concerns, some legislators proposed bills to ensure that the existing systems preserve the privacy of the people [19], [20], [21].
In this paper, we investigate an efficient and accurate person localization scheme with privacy preservation using federated learning and the camera surveillance systems of public places. With the proposed scheme, a law enforcement entity can locate persons of interest using the surveillance cameras of the public places without revealing the images of the visitors or the persons of interest to preserve privacy. Unlike most of the existing techniques that measure the distance between two embedding vectors and compare the result to a predefined threshold to determine whether the vectors are for the same person, we use a more accurate technique by training a machine learning model using federated learning where the inputs of the model include two embedding vectors and output is either zero or one to indicate whether the two vectors are for the same person or not. The idea is that, instead of using a simple threshold to determine whether two vectors are for the same person, a machine learning model can make accurate decisions because it can learn the characteristics of the embedding vectors of the same persons.
To train the model, we have created a dataset where each sample has two embedding vectors and a label which is one in case that the two vectors are for the same person and zero otherwise. Then, to preserve privacy, we investigate an efficient cryptosystem to enable a cloud server to evaluate the model over encrypted data and report the visited locations to the law enforcement agency without being able to access the images or the embedding vectors of the visitors and the persons of interest. The architecture of the machine learning model is determined in such a way that makes executing it over encrypted data efficient. Specifically, the model is executed by three parties as follows. Each public place computes an embedding vector for each visitor's image and inputs it to a neural network (a part of the model) and encrypts the output using a modified inner product encryption scheme and sends the ciphertext to a cloud server. The law enforcement agency does the same steps on the images of persons of interest. Finally, the server computes the inner product of the two vectors over encrypted data and executes the last layer in the model to determine whether the two vectors are for the same person without learning the images or the embedding vectors of the persons of interest or the visitors of the public places.
We have modified the cryptosystem in [22] that computes the inner product of two vectors using their ciphertexts. This cryptosystem is designed to run by two parties (a single public place and a single law enforcement agency in our application) using a pairwise key, so we have modified it to be more efficient in our application that has multiple public places, where each public place and the law enforcement agency uses only one key for encryption. Six datasets are used to evaluate our proposal. The results demonstrate that our model exhibits more localization accuracy comparing to the use of the Euclidean distance in case of several camera surveillance systems with different image quality. Our evaluations also demonstrate that the overhead of our scheme in terms of computation/communication overhead is acceptable. A formal proof and analysis are used to demonstrate the ability of our scheme to preserve privacy.
To the best of our knowledge, this is the first work that uses a combination of a deep learning model, federated learning, and efficient cryptosystem to evaluate the model using encrypted data to create an accurate and privacy-preserving person localization for multiple camera surveillance systems. Specifically, this paper makes the following contributions: • Most of the existing techniques depend on a simple approach that measures the distance between two embedding vectors and compares the result to a threshold value to determine whether the vectors are for the same person. In this paper, we use a more accurate approach that is suitable for several camera surveillance systems using a pre-trained machine learning model.
• We evaluate our model over encrypted data to preserve the privacy of the visitors of the public places and the persons of interest. To do that efficiently, the model is executed by there parties where layers of the model are VOLUME 10, 2022 executed using plaintext data at the public places and law enforcement agency and only the last layer is executed over encrypted data by the server. We have also modified the inner product cryptosystem in [22] to make it more efficient in our application.
• We use federated learning to train our model on the data of different camera surveillance systems with varying image quality without revealing the data to preserve privacy.
• To evaluate the privacy preservation capability of our scheme, we use a formal proof and analysis, and to evaluate the accuracy and the communication/computation overhead of our scheme, we have conducted extensive experiments.
We organize the rest of this paper as follows. The related works are discussed in Section II. The system models including the network and threat models and the important requirements that should be achieved by our scheme are discussed in Section III. Section IV explains our scheme. The results of the evaluations are discussed in Section VI. Finally, Section VII draws the conclusions.

II. RELATED WORK
Content-based image retrieval and face-recognition based authentication schemes are the closest research works to this paper. In this section, we first explain these schemes, and then discuss the research gap and our motivations.

A. CONTENT-BASED IMAGE RETRIEVAL
In image retrieval application, large image datasets are outsourced to a cloud server, and an image is sent to the server to search for similar images and return them. To ensure efficiency, especially, in case of large image datasets, instead of sending an image, the features of the query image is sent to the server which matches them to the features of the stored images, and then the server returns the images with close features. Image retrieval approaches are needed in many applications. Examples to these applications include medical diagnosis [23] and searching for similar clothes online [24].
Since revealing the images or their features to the cloud server may raise privacy concerns in some applications, various privacy-preserving image retrieval schemes have been proposed [11], [12], [25], [26]. In these schemes, the cloud server stores encrypted images' features, and it receives a query containing the features of an image of interest. Then, it searches the stored images to find the closest image to the query, i.e., the image that has close features to the queried image, without being able to learn neither the stored images nor the queried one or even their features.
In [11], a privacy-preserving hierarchical image retrieval system, called CASHEIRS, is proposed. CASHEIRS aims to address two main issues. The first issue is the low image retrieval accuracy and long time needed to search all stored images. The second issue is the privacy concerns raised when the images have sensitive information. For efficient search, CASHEIRS develops a hierarchical index tree which allows search over subsets of categories rather than whole set by clustering the images stored by the server. To improve the image retrieval accuracy, CASHEIRS uses Convolutional Neural Network (CNN) to extract the features of the images. To preserve privacy, the features of the images are encrypted while the server can measure the similarity score of the features, without decrypting the ciphertexts or learning the features.
The proposed scheme in [25] considers two privacy threats, including a cloud server that aims to infer sensitive information about the images, and a dishonest query user who illegally distributes the images he retrieved from the server. To protect against the first threat, the feature vectors are encrypted by the kNN searchable encryption algorithm. For the second threat, a watermark-based protocol is develop to deter distributing images. The cloud server uses this protocol to embed a unique watermark into each encrypted image retrieved by the user. The watermark can be extracted and the user is traced when an illegal copy of the image is found.
In [12], a large-scale content-based image retrieval scheme is proposed. Two different layers are used to preserve privacy. The first layer uses hash values for queries to hide the features because hash functions are one way. The server returns the hash values of all possible candidates and the user selects the best match for his query. In the second layer, the user deletes some bits in a hash value to make it computationally difficult for the server to learn the interest of the user. The paper also introduces the concept of tunable privacy, where the privacy protection level can be adjusted by dividing a feature vector into subsets and indexing every subset with a hash value which is associated with an inverted index list.
The existing content-base image retrieval schemes retrieve images based on the similarity of their visual features, but to improve the retrieval accuracy, interactive mechanism, namely relevance feedback, is integrated with these schemes to retrieve images based on both visual features and semantic concepts. The research work in [26] proposes a privacypreserving relevance-feedback image retrieval scheme. The scheme has three main stages including private query, private feedback and local retrieval. The initial query with a privacy controllable feature vector is conducted in the private query stage, and the private feedback introduces confusing classes that adhere to the K-anonymity in the creation of the feedback image set to preserve privacy. Finally, in the local retrieval, images are ranked at the user side.

B. FACE RECOGNITION-BASED AUTHENTICATION
The use of face recognition in biometrics-based authentication is an interesting approach because it is non-invasive and does not need the cooperation of users for taking their face images compared to other biometrics-based approaches that use iris and fingerprint. In the literature, several privacy-preserving face recognition-based authentication schemes have been proposed [13], [27], [28], [29].
In [27], an efficient and privacy-preserving face representation scheme that can be used for authentication by IoT devices is proposed. The scheme can satisfy the resource limits of the IoT devices. Bloom filter is used to ensure the privacy of the scheme. The idea is that the face data is stored in a Bloom filter which can be analyzed to do classification operations. To presreve privacy, no raw face data are stored by distrusted servers which store only Bloom filter representations.
In [28], a privacy-preserving identity authentication scheme based on the face recognition technology is proposed. CNN model is used to extract the facial features from face images. To preserve privacy, the feature vectors are encrypted using a nearest neighbor approach and to match images and identify persons, the cosine similarity is computed over the encrypted vectors. Moreover, for high authentication efficiency, the paper adopts edge computing where some operations are transferred from the cloud to the edge nodes.
In [29], an efficient privacy-preserving face identification system is proposed. For indexing and retrieval of faces, a hash generation scheme based on a Product Quantisation is developed for computing hash codes from faces and creating hash look-up table. Fully homomorphic encryption (FHE) is used to encrypt the face templates to preserve privacy. For authentication, face hashes are used for fast retrieval, i.e., returning a short list of candidates. Two main approaches are used to ensure efficiency. First, the use of look-up table does not need a one-to-many search, but the search results are obtained directly. In the second approach, FHE-based comparisons are executed for a small fraction of facial references.
In [13], an efficient privacy-preserving person reidentification scheme is proposed. To extract the persons' features, convolutional neural network (CNN) and kernels based supervised hashing (KSH) are used. To calculate the similarity of the images' features by the cloud server, a secret sharing based Hamming distance computation protocol is developed. Moreover, to allow users to validate the correctness of the matching results, a dual Merkle hash trees based verification is developed.

C. RESEARCH GAP AND MOTIVATIONS
Most of the existing papers in the literature measure the distance between two embedding vectors and use a threshold to decide whether the vectors are for the same image. However, this approach may not give good results when the data is large and obtained from different camera sources with different image quality and resolutions. Our experiments confirm this and the results are consistent with other works such as [30], [31], [32] which confirm that Euclidean distance metric is not preferable for high dimensional data mining applications. To address this issue, we propose a more accurate approach for multiple cameras surveillance system using a pre-trained machine learning model. The model is designed in such a way that makes its evaluation over encrypted data to preserve privacy efficient. We also train the model using federated learning to avoid sharing the images of the visitors and the persons of interest, and thus preserve privacy. Moreover, most of the existing encryption schemes that are used in the literature to match the images' vectors over encrypted data to preserve privacy are designed for one data source (i.e., single pubic place), and thus, it is inefficient to use them in multi-data-source system (i.e., multiple pubic places). To address this issue, we have modified a cryptosystem [22] that does inner product operations over encrypted data to be suitable for the case of multi-data-source system where vectors can be encrypted by multiple public places and each place uses a different key and the other vector is encrypted by a single entity (law enforcement agency).
To the best of our knowledge, this is the first work that uses a combination of a deep learning model, federated learning, and efficient cryptosystem to evaluate the model using encrypted data to create an accurate and privacy-preserving person localization for multi-camera surveillance system.

III. SYSTEM MODELS AND DESIGN GOALS
In this section, we first discuss the network and threat models considered in this paper, and then, we discuss the design goals that should be realized in our scheme.
A. NETWORK MODEL Figure 1 depicts the network model considered in this paper. It can be seen that the model has three main parties including an offline key distribution system (KDC), a law enforcement agency, public places, and a cloud server. The role of each party and the communication between them are explained as follows.
• Offline Key Distribution Center (KDC): The KDC distributes the secret keys needed to execute our scheme to the different parties in the system. This process is VOLUME 10, 2022 offline in the sense that once the KDC distributes the keys, it is not involved in the execution of the scheme.
• Law Enforcement Agency: This agency is the entity that has images for persons of interest and it needs to know the locations visited by these persons without knowing the images or the embedding vectors of the public places' visitors. For each image, it uses machine learning models to compute the embedding vector of the facial features of the person of interest, and then inputs the vector to a neural network (a part of the machine learning model we propose in this paper) and encrypts the output of the network with an inner product encryption cryptosystem and sends the ciphertext (called trapdoor) to the cloud server. The details of the neural networks and the cryptosystem are discussed in section IV • Public Places: Examples for public places include grocery stores, banks, gymnastics centers, gas stations, etc. Surveillance cameras are installed at the public places. The cameras take pictures of the visitors and use machine learning models to compute embedding vectors containing the visitors' facial features. Then, each public place passes each visitor's vector to a neural network (a part of the machine learning model we propose in this paper) and encrypts the output of the network with an inner product encryption cryptosystem and sends the ciphertext (called index) to the cloud server. The details of the neural networks and the cryptosystem are discussed in section IV.
• Cloud Server: The cloud server is an independent entity that is managed and operated by a third party. Using the indices and trapdoors sent by the public places and the law enforcement agency, the server computes the output of our model to learn the locations visited by the persons of interest and communicate this information to the law enforcement agency without knowing the images or the feature vectors of the visitors or the persons of interest.

B. THREAT MODEL
The attackers can be external eavesdroppers or internal entities such as the cloud server, the law enforcement agency, and the public places. The attackers can eavesdrop on all communications in the system. The paper focuses on the honest-but-curious threat model, where attackers do not want to disrupt the system, but they want to infer sensitive information including the images and the embedding vectors of the visitors and the persons of interest.

C. DESIGN GOALS
We aim to achieve the following important requirements in our scheme. 1) Privacy Preservation. Our scheme should enable the law enforcement agency to locate persons of interest while preserving the privacy of the public places' visitors, i.e., attackers should not be able to identify the visitors by revealing their images or feature vectors.
2) Accurate Localization. The accuracy of the person localization should be high under the setting of different public places' camera surveillance systems. To do that, instead of using a simple approach that measures the distance between two embedding vectors and uses a threshold to determine whether the two vectors are for the same person, we train a machine learning model that can better learn the characteristics of the embedding vectors of the same persons to make accurate decisions.
3) Scalability and Efficiency. The system is scalable in the sense that it has many public places and visitors, so our scheme should be efficient in the communication and computation overhead and the server should able to compute the output of the model using the indices of the visitors and the trapdoors of the persons of interest fast. To achieve this requirement, we modify an inner product encryption cryptosystem that requires a pairwise shared key between each public place and the law enforcement agency (single-single setting) so that the law enforcement agency uses only one key and computes only one trapdoor for each person of interest while this trapdoor can be matched to the indices computed with different keys by the public places.

IV. PROPOSED SYSTEM
In a typical face recognition system, the facial features within an image are encoded as a set of real-numbers called an embedding vector. The embedding vector of a person of interest is compared against a set of candidate embedding vectors, and a hit is reported if the distance between two vectors (e.g., using the Euclidean distance) is below a pre-define threshold value. This paper, instead, uses a deep-learning model to decide whether two input vectors are for the same person. In the case of multiple camera surveillance systems, we will demonstrate that this approach is more accurate compared to conventional techniques. Three main stages are executed by our scheme, called generation of embedding vectors, training of a similarity check model and encryption and localization.
In the first stage, law enforcement agency and public places encode the facial features of the image of each person of interest and visitor into an embedding vectors and then encrypt these vectors. A machine learning model is used to locate the face in the input image, and then another model is used to estimate the locations of the face landmarks. A face detection model, in specific, can be used to locate image areas containing faces. Because of its superior results in similar tasks, we use a pre-trained Convolutional Neural Network (CNN) sliding window model, called Dlib [33]. Dlib takes into account cases where a person's face might change depending on his/her posture and emotion. After the face detection, a face landmark localization is used to locate the important features of the face. A set of 68 landmark points on the human face are used to define these features. The points include the mouth, right and left eyes, right and left eyebrows, jawline, and nose. An embedding vector is generated using the 68 landmark points to better quantify face features. The landmark points' coordinates are used to generate a 128-d real-valued number embedding vector that encodes the input face. The details of this stage will be discussed further in subsection IV-A.
In the second stage, a federated deep-learning model is computed to decide whether two embedding vectors are for the same person instead of depending on threshold-based distance metrics. To train the model, a dataset that resembles the images from various public places and a law enforcement agency is created. Each row in the dataset contains two embedding vectors and a label, where one vector is from the dataset of the public places and the other vector is from the dataset of the law enforcement agency. The label indicates whether the two vectors are for different persons (i.e., belong to the negative class) or for the same person (i.e., belong to the positive class). The model is designed in such a way that makes it efficient to be evaluated by the cloud server using encrypted vectors. To do that, each of the two input vectors goes through a distinct set of learning layers. This part of the model is executed by the law enforcement agency and public places. Then, the output vectors of these layers are multiplied by the cloud server (over encrypted data) to compute the classification of the model, i.e., whether the two vectors are for the same person. This stage is demonstrated in detail in subsection IV-B.
Lastly, in the third stage, we investigate an efficient cryptosystem to enable the cloud server to determine the locations visited by a person of interest by executing the last layer in the model over the encrypted data without inferring the vectors or the images of the persons of interest and the visitors of the public places. This stage will be explained in subsection IV-C.

A. GENERATION OF EMBEDDING VECTORS
This subsection demonstrates the generation of the embedding vectors from the images of persons' faces. To generate embedding vectors, three stages are required including face detection, face landmark localization, and embedding vector computation. The details of these stages are as follow.
Face Detection: This step locates the areas of human faces in an image [34]. Viola-Jones' face detection system is developed for low-cost cameras [35]. The system is an object detection framework that integrates concepts such as Haar-like features, integral images, and cascade classifiers to provide fast and accurate object detection system. Recently, more accurate and low-cost solutions were developed. Deep learning approaches such as [33], [36], [37], [38], [39] are regarded to have the best detection performance. As a result, we use Dlib [33], the pre-trained CNN sliding window model, because of its high performance. The input of the model is a window taken from the image of interest and the output indicates whether there is a face in the window or not. The model is trained on the iBUG 300-Faces-In-the-Wild landmark dataset [40], and it has three down-sampling layers, four convolution layers, and a feed-forward layer.
Landmarks Localization: The features of a person's face may change depending on his or her posture, lighting conditions, and facial expression. Consequently, even images of the same person can result in low similarity score under different conditions. Thus, face landmark localization is adopted to extract a variation-independent set of features that can boost the similarity score under different conditions. A set of 68 landmark points that define mouth, right and left eyebrows, nose, right and left eyes, and jawline are used as variation-independent features. Table 1 gives the number of the landmark points and its belonging to the facial regions. Our approach relies on the Dlib [33] face detection and the face alignment technique proposed in [41]. The Dlib is trained on a dataset of facial landmarks of different persons where each landmark is defined with (x, y) coordinates.
The intensity of the pixels at each landmark is used to train a set of t tree-based regression predictor functions {r 0 , r 1 , . . . , r t−1 }. For an input image, each predictor function estimates the position of the facial landmarks' position. The gradient boosting tree algorithm [40] is used to train the predictor functions on the iBUG 300-Faces-In-the-Wild landmark dataset.
Computation of Embedding Vector: This stage converts the detected landmark positions into an embedding vector that expresses the facial features. A deep learning model is used to generate the embedding vector where the input is three images called triplets. Two images of the triplets are for the same person but under different conditions. These two images are referred as the anchor, and the positive. The third image is called the negative input where a different random face is used.
The loss function used in the training has two components. The first component aims to reduce the difference between the vectors generated for the positive and the anchor inputs. The second component aims to increase the difference between the vectors generated for the negative and anchor input. The triplet loss function is defined as follows: where Pos, Neg, and An define the positive, negative, and anchor inputs, respectively. The value γ is a small margin between negative and positive inputs. It should be noted that VOLUME 10, 2022 θ defines the optimal parameters of the deep neural network. The cost over a batch of M training triplets can be used to describe the training optimization as follows: The architecture of a network relies on the implementation of the ResNet34 network as discussed in [42]. However, less number of layers and filters is used to speed up the training. A dataset collected from a variety of sources is used for training [33]. The dataset size is approximately three million images. The performance of this network outperforms the existing image recognition approaches in accuracy [33].

B. TRAINING OF A SIMILARITY CHECK MODEL
Using a standard distance metrics (e.g., Euclidean distance) to measure the similarity between embedding vectors is not often a preferable choice because of the low performance in many practical applications [30]. This is because they need to compute a threshold for the maximum distance allowed between two embedding vectors to be for the same person. In addition, standard distance metrics are excessively reliant on the image source from which the vectors are computed.
As discussed in our network model, various kinds of surveillance cameras may be deployed in public spaces, with cameras of varying models, resolutions, and images quality. Thus, an optimal distance threshold for a dataset generated by one camera might not be the optimal threshold for another camera. To demonstrate this claim, we have used six publicly available datasets to conduct experiments that identify the optimal threshold of each dataset. The datasets include IRIS Dataset [43], Head Pose Image Dataset (HPID) [44], the Extended Yale Face Dataset B (EYaleB) [45], FEI Face Database [46], Georgia Tech Face Dataset (GTech) [44], and Yale Face Dataset [47]. More details about the datasets will be discussed in section VI. Figure 3 gives the relation between the accuracy in terms of the percentage of properly identified images and the threshold value for the six datasets. It can be seen that FEI's optimal threshold is around 0.1 which means that if the distance between two vectors is less 0.1, they are deemed for the same person. When the threshold value is below 0.1, the accuracy starts to degrade as the Euclidean distance between the two vectors of the same person exceeds this low threshold, and thus, they are deemed for different persons. A degradation in the accuracy also occurs when the threshold value is set greater than 0.1. This is because a person's image is mistakenly deemed to belong to other persons because their Euclidean distances are less than this large threshold value. The same conclusions can be drawn from the results of the other datasets, but with different optimal threshold values. Consequently, It is not possible to find a threshold value that provides optimal performance for all datasets. To address this issue, we train a deep learning model that can accurately determine the embedding vectors of the same person instead of depending on threshold-based distance metrics. We design the model in such a way that requires an efficient cryptosystem to evaluate it using encrypted data to preserve privacy, as will be explained later.
The design of our deep learning model is shown in Figure 4. The model is executed by three parties as follows. The public place evaluates the first set of layers of the model using the embedding vector of each visitor, while the law enforcement agency evaluates the second set of layers using the embedding vector of each person of interest. The outputs of these two sets of layers are encrypted using the cryptosystem that will be discussed in subsection IV-C, and then the ciphertexts are sent to the cloud server. Finally, the server evaluates the last layer in the model by computing the inner product of the two vectors using their ciphertexts and executes 13 return w to the server; a sigmoid activation function over the output to classify the vectors either for the same person or not. The key reason for this design is that most of the computations can be done in the plaintext domain and it needs an efficient cryptosystem to enable the server to execute the last layer and learn the visited locations by the persons of interest without being able to obtain the images or the embedding vectors of the persons of interest or the public places' visitors. The details of the cryptosystem and the secure inner product evaluation are discussed in subsection IV-C.
To train the model, federate learning algorithm denoted as FederatedAveraging [48] is employed. The idea is that each public place and the law enforcement agency creates a local dataset where each row in the dataset contains two embedding vectors and a label which indicates whether the two vectors are for the same person (i.e., belong to the positive class) or for different person (i.e., belong to negative class). Once the local datasets are created, the public places and the law enforcement agency participate in the training of the model by first training local models on their local datasets and then sharing their models' updates with the cloud server as illustrated in Algorithm 1.
For every communication round t, the global model w t computed by the server is downloaded by each participant. Then, each participant p computes the weights updates on their local dataset p k using the current version of the global model and sends the ephemeral and focused updates to the server. The cloud server combines the updates of the different participants by averaging them to create a more accurate global model w t+1 . Weighted averaging is used by the server to compute the aggregated model weights. Note that, the initial global model w 0 is either selected randomly or by pre-trained model on a public dataset. Federated learning has been proved to be more secure with using approaches like privacy-preserving data aggregation [49] and differential privacy [50].
There are two main advantages that can be achieved by using federated learning. First, it can preserve privacy because the participants do not need to reveal their sensitive data. It can also achieve efficiency because the participants share only the updates of the local models whose size is much smaller than the size of the dataset.

V. PRIVACY AND SECURITY ANALYSIS
Our scheme can achieve the following prepositions. VOLUME 10, 2022

Preposition 1: The cloud server can learn the locations visited by a person-of interest without being able to identify the visitors or the persons of interest.
Proof: We use the following notations to prove this preposition.
History. We denote the group of encrypted vectors resulted from the evaluations of the neural networks by the public places on the embedding vectors of the visitors' images as a set of indices I P = {I 1 , I 2 . . . , I k } which correspond to the vectors W = {w 1 , . . . , w k }. In additions, we denote the group of encrypted vectors resulted from the evaluations of the law enforcement agency's neural networks on the embedding vectors of the person-of-interest images as trapdoors T LEA = {T 1 , T 2 , . . . , T j }, which correspond to V = {v 1 , . . . , v j }. The history is denoted as Hst = {I P , T LEA }.
Trace. It represents the information the cloud server can deduce by analyzing the history Hst, denoted as Tra(Hst), where Tra(Hst) is defined over all the trapdoors, i.e., Tra(Hst) = {Tra(T 1 ), . . . , Tra(T l )}. The search pattern is an example for traces.
View. It represents the observation of the cloud server, which is represented by the encrypted history and its trace, denoted by View(I P , T LEA , Tra(Hst)).
A simulator SIM wants to compute a false view View that is indistinguishable from the true view View using the following steps.
1) SIM executes the oracle SystemSetup() to get a secret key SK .

2) SIM computes a set of visitors' embedding vectors
where W is a random copy of W . 3) SIM computes a set of embedding vectors for random images V = {v 1 , . . . , v j } such that |v i | = |v i |, 1 ≤ i ≤ j. Note that V is a random copy of V. 4) SIM computes an index I P and trapdoor T LEA using SK , V , and T LEA .
Based on the above construction, our scheme can achieve adaptive distinguishability if for any SIM with a history Hst = {I i , I T } and trace Tra(Hst ) similar to Tra(Hst) such that an adversary cannot distinguish between the two views View(I P , T LEA , Tra(Hst)) and View (I P , T LEA , Tra(Hst )).

Preposition 2: The embedding vectors of outsourced visitors' indices and trapdoors of persons of interest cannot be obtained by adversaries.
Proof: In our scheme, an inner product encryption cryptosystem is used to encrypt the resulted vector after inputting the embedding vector of a visitor or a person of interest to a neural network. Without knowledge of the secret keys, decrypting the indices and trapdoors is impossible. Because each pubic place uses a unique key, the indices of a public place cannot be decrypted with the secret keys of the other places. We conclude that our scheme is secure in the known ciphertext model, where attackers cannot obtain the secret keys or the plaintext vectors using the trapdoors and indices.

Preposition 3: The indices of same persons are not linkable under the known-ciphertext model.
Proof: Our cryptosystem uses random numbers in the encryption process to ensure that the ciphertexts of the same embedding vectors look different and are unlinkable. Specifically, for each visitor's embedding vector, the public place generates a random number by picking up a random element α ← Z q and uses it to compute the index, and the law enforcement agency uses β ← Z q in the computation of the trapdoor, so when an index or trapdoor is computed for the same image's embedding vector, it looks different. This feature is important to prevent linking the indices of the same person who visits different public places. Tracing the locations of a person for a long time may lead to the identification of the person from the visited locations.
Preposition 4: Each public place cannot decrypt the ciphertexts of other places because a shared key is not used, i.e., each public place has a unique secret key.
Proof: If a public place can decrypt the ciphertexts of other places, it can track the locations visited by the visitors because the plaintext embedding vectors of the same person are close, and then it can identify the persons from the visited locations. In our scheme, the indices computed by one public place cannot be decrypted by other places because all public places do not use the same secret key, but each place uses a unique key. In spite of using different keys by the public places to compute the indices, the cloud server is still able to evaluate the machine learning model by computing the inner product of the indices and the trapdoors computed by the law enforcement agency. Moreover, a public place cannot use its secret key to compute the secret keys of the other public places because the key has N 1 −1 B , N 2 −1 B and B + B = B −1 and thus the public place cannot know the master key N 2 −1 , N 1 −1 , and B −1 . It cannot also know the random matrices B and B that are used to compute the other public places' keys.
Preposition 5: The cloud server should not be able to match a large number of indices and trapdoors to avoid leaking side information Proof: The cloud server should be able to match indices and trapdoors to find the locations visited by persons of interest without being able to identify the persons. However, if the cloud server has a large amount of data collected over a long period of time, it may use the data to infer statistical and side information such as collecting a large number of locations visited by an anonymous person of interest. To prevent the cloud server from collecting side information, the keys of the involved parties should change frequently, e.g., every month, to make sure that the ciphertexts sent after updating the keys cannot be matched to the old ciphertexts because they are encrypted with different keys.

VI. EXPERIMENTAL RESULTS
In this section, the performance of the proposed scheme is evaluated using the following metrics: (1) computation and communication overhead, and (2) localization accuracy.

A. EVALUATIONS OF THE CRYPTOSYSTEM
Our scheme is implemented using Python programming language and a machine with Intel 8 Cores i7-8665U CPU 1.90GHz processor and 16 GB RAM. This subsection discusses the communication and computation overhead of our scheme.

1) COMPUTATION OVERHEAD
The computation overhead is measured by the times needed to encrypt a vector by the law enforcement agency and public places, evaluate the model using encrypted data by the cloud, and compute the keys by the KDC. Table 2 gives the computation times of the main operations used by our scheme, where T B , T E , T M and T A stands for the times required for computing one bilinear pairing, exponentiation, multiplication, and addition, respectively. These operations are used in our scheme to compute keys, encrypt vectors, and measure the similarity of two vectors using their trapdoors and indices.
The last layer of the neural network that is encrypted by the public places and the law enforcement agency is composed of 16 group elements. To compute the key of the law enforcement agency, 4096 multiplication operations are needed for a vector size of 16 elements and 3840 addition operations, i.e., 4096 * T M + 3840 * T A , which takes around 20.48 + 8.1 = 28.58 ms using the measurements given in Table 2. The key of each public place requires 4096 multiplication operations and 4096 addition operations, i.e., 4096 * T M + 4096 T A , which takes 20.48 + 8.6 = 29.08 ms. Also, as shown in Figure 5, our scheme can reduce the number of keys that need to be computed in the system from 2n in the cryptosystem [22] to n + 1 in our scheme, where n is the number of public places. This is because the cryptosystem [22] is designed for single public place and single law enforcement agency setting, where the law enforcement agency needs to share a unique key with each public place, while our scheme is designed for multiple public places and single law enforcement agency setting, where each public place and the law enforcement agency use only one key.
To encrypt the vector of a person-of-interest by the law enforcement agency or the vector of a visitor by each public place, 33 exponentiation operations, 480 addition operations, and 544 multiplication operations are needed, i.e., 33 * T E + 480 * T A + 544 * T M , which takes 39.47+1+2.72 = 43.2 ms. To compute the inner product of two vectors using indices and trapdoors, the cloud server needs 33 bilinear pairing and 16 multiplication operations, i.e., 33 * T B + 16 T M , which takes 221.04+0.08 = 221.12 ms. Also, as shown in Figure 6, our scheme can reduce the number of encryption operations that are needed for each vector of a person-of-interest from n in the cryptosystem [22] to only one. This is because in [22], the law enforcement agency needs to encrypt each vector n times with the n keys shared with the public places, while in our scheme, the law enforcement agency has only one key that is used to do only one encryption operation.
Based on the results given above, we can conclude that the computation times are in the order of msecs. This proves that our scheme is efficient, practical, and scalable. The scalability is important in our application because the public places may be visited by a large number of persons, and thus they need to do many encryptions and the cloud server needs to do a lot of localization operations.

2) COMMUNICATION OVERHEAD
The communication overhead is measured by the size of the messages sent to the cloud server and also the number of keys that are distributed to the system's parties. Since the last layer of the neural network at the public places and the law enforcement agency is composed of 16 elements, as given in Section VI-B2, using asymmetric pairing curve (BN256) of size 256 bits where the size of a group element is 32 Bytes, the size of each encrypted vector (index or trapdoor) is 33× 32 Bytes (1.056 KB). Also, as shown in Figure 6, our scheme reduces the number of encryptions that are needed for each vector of a person-of-interest from n in the cryptosystem [22] to only one. For the key size, it has two matrices with 16 × 16 elements in Z q . The key size is 2 × 16 × 16 × 16 = 8 KB, where the size of each element in Z q is 16 Bytes. Figure 5 indicates that our scheme can reduce the number of keys that need to be distributed in the system from 2n in the cryptosystem [22] to n + 1 in our scheme, where n is the number of public places.
The results given above indicate that the communication overhead of our scheme is acceptable and the existing communication protocols can transmit the encrypted vectors in short time. We can conclude that our scheme is efficient and practical.  The total number of keys in the system with using the cryptosystem in [22] (single public place and single law enforcement agency setting) and our scheme (multiple public places and single law enforcement agency setting). The datasets used in our experiments to evaluate our machine learning model include IRIS Dataset [43], Head Pose Image Dataset (HPID) [44], Georgia Tech Face Dataset [44], Yale Face Dataset [47], FEI Face Database [46], and the Extended Yale Face Dataset B (EYaleB) [45]. Each dataset is processed and assumed to belong to one public place. The subjects in each dataset were divided into two groups. The first group in dataset i is selected randomly and denoted as X i POI , and it represents the images of the persons of interest. The second set of images, denoted as X PP , represents the images of the visitors of the public places. The two groups have an equal number of images.

2) RESULTS AND DISCUSSION
Python Dlib [33] face_recognition library is used to generate the embedding vectors of each dataset. Each embedding vector is normalized to the unit norm using l2-normalization, allowing the Euclidean distance to be determined from the dot product of any two embedding vectors. As shown in Fig. 4, the input of our model is two embedding vectors and a feed forward architecture is used. We use 5-fold cross validation to find the optimal values for the model's hyper-parameters, such as the type of activation functions and the number of layers, and the top performing model is selected. We found that the top performing model has eight hidden layers with the following dimensions [128,128,64,64,32,32,16,16], hyperbolic tangent activation function, and Adam optimizer.
Our privacy-preserving federated learning model (i.e., with encrypted vectors, denoted as MD + Privacy) is compared to two baselines. The first baseline is a federated learning model without privacy (denoted as MD). In the federated learning, each dataset is partitioned into training and testing with ratio 5:1. The deep architecture of all the models is set to be the same. The second baseline uses the Euclidean distance metric for localization instead of a machine learning model, where each dataset is used to compute the threshold needed to locate the persons-of-interest. The results are given in Table 3.
The results indicate that our scheme performs better than the Euclidean distance approach because instead of using a threshold to decide the similarity of an index and a trapdoor, we use a machine learning which can learn the features of the embedding vectors of the same persons and thus make accurate decisions. Note that, as discussed earlier, it may be difficult to find a good threshold that can give good performance in case of images taken from different sources. Moreover, the results indicate that the privacy-preserving model (MD + Privacy) performs almost similar to the plaintext model (MD)  in terms of overall performance. This indicates that executing the model over encrypted data using our cryptosystem does not degrade the accuracy of the model. Figure 7 shows the number of communication rounds needed by the federated averaging algorithm to converge. The figure shows that our model starts to achieve over 90% of training accuracy after the first 20 communication rounds, and it takes around 80 rounds for the loss and the validation accuracy to converge. The given results indicate the efficient training of the developed neural network architecture using federated learning. Figure 8 shows the Receiver Operating Characteristics (ROC) and the Area Under Curve (AUC) for MD, MD+Privacy, and Euclidean distance approach. The black line indicates a random performance classifier. The given results of both MD and MD+Privacy indicate that the performance of our scheme with privacy preservation is comparable to that of the scheme without privacy preservation with almost no performance loss. In addition, both MD and MD+Privacy outperform the Euclidean distance approach.

VII. CONCLUSION
This paper proposes an accurate person localization scheme that enables a law enforcement agency to locate persons of interest with privacy preservation. Our scheme trains a machine learning model to decide whether two embedding vectors storing facial features are for the same persons. Using six publicly available datasets, our experimental results indicate that our approach is more accurate than the existing approaches that measure the Euclidean distance because an optimal decision threshold of a dataset might not be the optimal threshold for the other datasets. Our machine learning model is designed in such a way that makes executing it over encrypted data efficient. Most of the model's layers are executed using plaintext data by the public places and the law enforcement agency and only one layer is executed by the server over encrypted data using an inner product encryption scheme to preserve privacy. We have also modified an inner product encryption cryptosystem that is designed for a single public place to make it more efficient in our application that has multiple public places. Our experiments indicate that this modification can significantly reduce the number of keys in the system and the number of ciphertexts that are computed by the law enforcement agency. To prevent leaking sensitive information by sharing the images of the visitors to train the model, we use a federated learning training approach. Our experiments indicate that our scheme has high localization accuracy and the use of federated learning and executing a part of the model over encrypted data has a slight impact on the accuracy. The results of a formal proof and extensive analysis confirm that our scheme can preserve the privacy of the public places visitors.