PrivFT: Private and Fast Text Classification With Homomorphic Encryption

We present an efficient and non-interactive method for Text Classification while preserving the privacy of the content using Fully Homomorphic Encryption (FHE). Our solution (named Private Fast Text (PrivFT)) provides two services: 1) making inference of encrypted user inputs using a plaintext model and 2) training an effective model using an encrypted dataset. For inference, we use a pre-trained plaintext model and outline a system for homomorphic inference on encrypted user inputs with zero loss to prediction accuracy compared to the non-encrypted version. In the second part, we show how to train a supervised model using fully encrypted data to generate an encrypted model. For improved performance, we provide a GPU implementation of the Cheon-Kim-Kim-Song (CKKS) FHE scheme that shows 1 to 2 orders of magnitude speedup against existing implementations. We build PrivFT on top of our FHE engine in GPUs to achieve a run time per inference of 0.17 seconds for various Natural Language Processing (NLP) public datasets. Training on a relatively large encrypted dataset is more computationally intensive requiring 5.04 days.


Introduction
As Machine Learning (ML) has been widely adopted in critical electronic systems that may deal with private data, preserving the privacy of data has become one of the major issues facing the technology.This has directed researchers to look into privacy-preserving techniques to tackle this problem.One promising technique is the Fully Homomorphic Encryption (FHE), which is a new class of encryption schemes that allow computing on encrypted data (Gentry, 2009).FHE has been shown to be useful in a wide range of privacypreserving applications especially in ML (Dowlin et al., 2016;Badawi et al., 2018;Chou et al., 2018), statistical analysis (Aslett et al., 2015;jie Lu et al., 2016), deep learning (Dowlin et al., 2016;Badawi et al., 2018) and Genome-Wide Association Studies (GWAS) (Wang et al., 2015;Lu et al., 2015).Despite its incredible capabilities, FHE suffers from two main issues: 1) high computational overhead and 2) limited arithmetic set (only addition and multiplication on encrypted data).This means that one needs to build the desired function as a circuit1 so it can be evaluated with FHE.
In this paper, we choose a shallow neural network (fasttext) (Joulin et al., 2016) for the task of Text Classification.Using simple techniques, the model achieves competitive results to those with more complex architectures.More importantly, this choice allows us to adapt FHE and perform classification directly over encrypted text data.Using GPU, the inference steps can achieve significant speedup for practical real-time applications.Finally, we show how an untrusted server can train this model using encrypted data to generate an encrypted model for the data owner.

Tasks Overview
The system proposed in this paper, which we call PrivFT2 , performs two main tasks as described below: • Homomorphic inference on encrypted data: in this task, we focus on how to perform inference on encrypted input texts.We consider a secure Machine Learning as a Service (MLaaS) system where an NLP model, previously trained on non-encrypted data, is stored in the cloud.The client, or data owner, uses a homomorphic encryption scheme to transform her plaintext input into encrypted form (ciphertext) and sends it to the cloud model for inference.This evaluation generates an encrypted output that is sent back to the client who can decrypt and obtain the output in plaintext form.
• Homomorphic training on encrypted data: in this task, we focus on how to train a model from scratch using encrypted dataset.Here, the data owner sends her encrypted data to the cloud which performs a batched training algorithm using back-propagation to learn an encrypted model (i.e.training operations are done entirely in ciphertext space).The encrypted model is sent back to the client who can decrypt and use for local inference.
We emphasize that in both tasks no decryption takes place at the cloud side, which makes our solution as secure as the encryption scheme itself.To the best of our knowledge, the homomorphic encryption scheme used here is still considered secure when the parameters are set appropriately.
To give a motivational use-case of text inference on encrypted data, the fully encrypted e-mail service shown in Figure 1 can be useful.In this system, Alice composes an e-mail and encrypts it using Bob's public key.The encrypted e-mail is sent to the mail server, which can still run a spam detection algorithm homomorphically with FHE.The encrypted result of spam detection and e-mail are forwarded to Bob, who can decrypt and decide whether to open the e-mail or discard it.This can also be applied to other similar use-cases such as targeted advertising, personalized marketing, recommendation systems, etc.
Our main contributions can be summarized as follows: • We provide the first GPU implementation of a Residual Number System (RNS) variant of the Cheon-Kim-Kim-Song (CKKS) levelled FHE scheme.
• We demonstrate how homomorphic inference can be performed on encrypted data for text classification.
• We demonstrate how to train a model using encrypted dataset with FHE.
• We provide benchmarking experiments to evaluate the performance of our solution.

Organization of the paper
The rest of the paper is organized as follows.Section 2 reviews the basic terminology and concepts the paper builds on.Section 3 describes how PrivFT performs both inference and training on encrypted data.Section 4 provides the implementation details of the CKKS scheme and PrivFT.We present our experimental results in Section 5. Finally, Section 6 concludes the work and provides directions for future work.

Notations
We use capital letters to refer to sets and small letters for elements of a set.The sets Z, Q, R, and C denote the integers, rationals, reals and complex numbers, respectively.We use capital bold letters for matrices and bold small letters for vectors.The symbols • , • , and • denote the round up, round down and round to nearest integer, respectively.The notation |a| q denotes the remainder of a divided by q in the balanced set { − q 2 , ..., q−1 2 }.Finally, sampling a from a set S is denoted as a ← − S.

Fully Homomorphic Encryption
FHE schemes are cryptographic constructions that provide the ability to compute on encrypted data without decryption (Gentry, 2009).Unlike classic encryption schemes, FHE maps the input clear text data (or plaintexts P) to encrypted data (or ciphertexts C) such that the algebraic structure between P and C is preserved over addition and multiplication.Let a, b ∈ P and h denotes the encryption operation, then h(a) ⊕ h(b) = h(a + b) and h(a) h(b) = h(a • b), where ⊕ and are homomorphic addition and multiplication, respectively, and where equality is achieved after decryption.This allows one to evaluate arbitrary computations (modeled as circuits) on encrypted data by only manipulating the ciphertexts.Modern FHE schemes conceal plaintext messages with noise that can be identified and removed with the secret key (Brakerski and Vaikuntanathan, 2011).As we compute on encrypted data, the noise magnitude increases at a certain rate (high rate for multiplication and low rate for addition).As long as the noise is below a certain threshold, that depends on the encryption parameters, decryption can filter out the noise and retrieve the plaintext message successfully.Although FHE schemes include a primitive (known as bootstrapping) to refresh the noise (Gentry, 2009), it is extremely computationally intensive.Instead, one can use a levelled FHE scheme (Brakerski et al., 2014) that allows evaluating circuits of multiplica- tive depth3 below a certain threshold, which can be controlled by the system parameters.The literature includes various FHE schemes that vary in the underlying mathematical structures used, capabilities and performance.This work uses the (CKKS) levelled FHE scheme (Cheon et al., 2017), which was proposed specifically to deal with floating-point numbers.
Although FHE is known for being notoriously slow (Naehrig et al., 2011), a number of major advancements have improved its performance dramatically such as (1) packing methods (Smart and Vercauteren, 2014), which allow one to pack a vector of plaintext items in one ciphertext enabling vectorized homomorphic operations without extra cost, (2) fast modular arithmetic that replaces slow multi-precision operations with embarrassingly parallel native operations (Bajard et al., 2016;Halevi et al., 2018;Cheon et al., 2018), and (3) hardware acceleration via GPUs (Al Badawi et al., 2018;Al Badawi et al., 2019) which can provide 1 to 2 orders of magnitude against CPU implementations.

Text Classification
Text Classification is a task in Natural Language Processing (NLP) with numerous applications such as Sentiment Analysis, Spam Detection, Topic Classification and Document Classification.Much of the recent NLP research has focused on transfer learning techniques such as pre-training word embeddings (Pennington et al., 2014;Mikolov et al., 2013), or pre-training language models on larger datasets and fine-tuning them for task-specific learning (Radford et al., 2017;Peters et al., 2018;Devlin et al., 2018;Howard and Ruder, 2018).SentimentUnit (Radford et al., 2017) employs a single layer multiplicative Long Short Term Memory (LSTM) while ELMO (Peters et al., 2018) adopts a base architecture which contains multi-ple layers of Bidirectional LSTM.BERT (Devlin et al., 2018) on the other hand uses the Transformer architecture (Vaswani et al., 2017) which consists of stacked attention layers.In ULMFit, the AWD-LSTM (Merity et al., 2017) is used for pre-training and at fine-tuning time is able to achieve impressive results across a variety of text classification tasks.
While these approaches achieved remarkable results across many NLP tasks, they are prohibitively expensive for FHE adaptation.As mentioned above, FHE schemes typically introduce a noise term in the encrypted ciphertext which grows with every addition or multiplication operation.Depending on the encryption parameters, the decryption of such ciphertext only yields the correct result when this noise term is small enough.As a result, both the recurrence functions in LSTM as used in Senti-mentUnit, ELMO, BERT, ULMFit and the stacked Transformer attentions in BERT result in a level of multiplicative depth that would corrupt the result of the computations once decrypted.
Motivated by this limitation, we choose fasttext (Joulin et al., 2016) for our work as it is a shallow network consisting of only two layers: an embedding layer and an output fully connected layer.The input to the model is a vector x = [x 1 , x 2 , . . ., x w ] where x i is a positive integer in {1, . . ., m}.The embedding layer uses a hidden lookup matrix H of size m × n where n is the dimension of the real vectors.The lookup vectors are averaged over all w to produce a single vector h = 1 w i H(x i ) ∈ R n .The output layer uses a matrix O of size n × c to map the result into the output space containing c classes.For training, a softmax function is used to compute the loss function.At inference time, however, one only needs to compute the unnormalized scores s = Oh.Thus, the multiplicative depth in a single inference pass is only 3 (2 vector-matrix multiplications and average computation) which is an attractive level for FHE implementation.

The CKKS levelled FHE Scheme
In order to describe the CKKS scheme (Cheon et al., 2017) we need to first introduce some notations.Let R be the cyclotomic polynomial ring Z[X]/ X N + 1 , where N (the ring dimension) is a power of 2. Let q L > q L−1 > • • • > q 1 be L positive integers such that q l = l i=1 p i , where p i 's are 30-bit prime numbers.In CKKS, L is the maximum multiplicative depth supported, and at any level l = 1 . . .L, arithmetic operations are performed in R q l = Z q l [X]/ X N + 1 .Note that a freshly encrypted ciphertext is at level L and moves to a lower level as further computation is performed.Denote DFT and IDFT as the Discrete Fourier Transform and its inverse, respectively.Given input data represented as real or complex numbers, we use a modified version of CKKS (Cheon et al., 2018) for encryption (Enc) and decryption (Dec) as follows: • SETUP: given a desired security level4 λ, and maximum computation levels L, initialize CKKS by setting N , two uniform random distributions: X key over R 2 and X q L over R q L , and a zero-centered discrete Gaussian distribution X err with standard deviation σ over where a ← X q L and b = −as + e with e ← X err .• ENCODE(v, ρ): given a vector of complex numbers v ∈ C N/2 and precision ρ, return a polynomial µ = IDFT(2 ρ v) ∈ R. • Enc(µ): given a plaintext message (µ), sample u ← X q L and e 0 , e 1 ← X err .Return ciphertext ct = (c 0 , c 1 ) = (av + µ + e 0 , bv To enable computations in the ciphertext space, the following homomorphic operations are given: • HADD(ct 0 , ct 1 ): homomorphic addition takes two ciphertexts (at the same level l) and returns ct • HMUL(ct 0 = (c 00 , c 01 ), ct 1 = (c 10 , c 11 )): homomorphic multiplication takes two ciphertexts (at the same level l) and returns ct × = (c 00 c 10 , c 00 c 11 +c 01 c 10 , c 01 c 11 ) ∈ R 3 q l .Note that a procedure known as relinearization (Cheon et al., 2017) can be used to reduce ct × back to two elements ∈ R 2 q l .• HADDPLAIN(ct, pt): homomorphic addition of a ciphertext ct = (c 0 , c 1 ) ∈ R 2 q l and plaintext pt ∈ R returns ciphertext ct CKKS mimics fixed-point arithmetic for approximate computing on encrypted numbers.Input real numbers are scaled with a fixed-precision factor and rounded to the nearest integer (quantization).For instance, the value 3.14159 can be represented as 3142 with a scale factor of 1/1000.To maintain a fixed precision of the intermediate values, CKKS offers an efficient rescaling procedure (RESCALE) to remove the least significant bits of intermediate results.For instance, after multiplying two messages m 1 and m 2 each scaled with factor ρ, RESCALE produces a rounded version of the product ρ • RESCALE(ct, l ): given ciphertext at level l and l = l −1, return ct = q l /q l •ct ∈ R q l .
As mentioned in section 2.2, one can drastically improve FHE performance via packing methods.In CKKS, a vector of up to N/2 complex numbers can be encoded in a single plaintext element.This allows one to perform Single-Instruction Multiple-Data (SIMD) homomorphic operations on packed ciphertexts for free.Packing can be viewed as if the ciphertext has independent slots, each concealing one data item.To facilitate later discussion on packed ciphertexts arithmetic, we will need the following utility function: • ROTATE(ct, r): the packing slots inside a ciphertext can be rotated by computing ct = (c 0 (X r ), c 1 (X r )).5 The reader is referred to the referenced papers for proofs on the correctness and security of the scheme.

PrivFT
In this section, we discuss the two main tasks in PrivFT: inference and training with FHE.

Inference with FHE
To adapt the initial lookup and average-pooling layers in fasttext with FHE, we first require the client to encode her text (x) into a one-hot vector (v) of length m, where v i = 1 ⇐⇒ x contains word i.Thus, these two layers can be substituted by a vector-matrix multiplication (v T H) followed by a scalar division (for average).This alleviates the server from the need to perform expensive homomorphic dictionary look-up operation.Since m is usually large, we pack the hidden matrix H vertically to improve performance.Specifically, the server encodes the weights matrices of the hidden and output layers as shown in Figure 2. To encode H, we use n × m/t plaintexts, where t is the number of slots in plaintext/ciphertext6 .For the output matrix O, we assume that the number of classes c is less than t and pack the weights horizontally requiring only n plaintexts.If c > t, multiple plaintexts can be used.
To compute (v T H), the dot-product is computed via element-wise ciphertext-plaintext multiplication (h j = HMULPLAIN(v i , Ptxt j,i )), followed by m t HADD(h j , h j+1 ) to generate a single ciphertext ct.The elements in the slots of ct can be summed to generate h j using the TotalSum procedure (Halevi and Shoup, 2014) in O(log N ) time complexity, as shown in Algorithm 1.A similar approach is used to multiply with the output matrix O.The class scores ciphertext is communicated back to the client who can decrypt and find the best class.Note that we suffer zero loss to inference accuracy with this setup.

Training with FHE
In order to train the network with FHE, a number of challenges have to be addressed.First, the maximum multiplicative depth of the training circuit should be minimized in order to avoid bootstrapping.We use gradient descent with large minibatch size to reduce the number of weight updates.

Algorithm 1 TotalSum
Input: ciphertext ct encrypting vector v and number of slots t Output: a ciphertext encrypting the total sum of elements in v, duplicated in slots 1: for i = 1 to log 2 N do 2: t = ROTATE(ct, 2 i )) 3: ct = HADD(ct, t) 4: end for 5: return ct Moreover, we use a small number of epochs to train the model.The second challenge is how to evaluate the loss function with FHE.In fasttext the authors use Softmax as a loss function.Since FHE provides only addition and multiplication, evaluating Softmax as a circuit can be very expensive.Instead, we use a shallow approximation polynomial that can be evaluated cheaply with FHE.We use the polynomial ( 1 8 X 2 + 1 2 X + 1 4 ) which showed no noticeable loss in accuracy.

GPU Implementation of CKKS
We implement the RNS variant (Cheon et al., 2018) of the CKKS levelled FHE scheme (Cheon and Kim, 2018) using CUDA 10.The usage of RNS allows implementing the scheme using native 32and 64-bit operations with high parallelism instead of slow, serial multi-precision operations.Core polynomial arithmetic, RNS and Discrete Galois Transform (DGT) tools were obtained from the A * HE GPU library (Al Badawi et al., 2018;Al Badawi et al., 2019) for high parallelism and improved performance.Below is a brief description of our implementation of the new tools required by CKKS.
We instantiate CKKS again on the cyclotomic polynomial ring R and use ciphertext moduli q L > q L−1 > • • • > q 1 .At any level l, arithmetic is performed in R q l = Z q l [X]/ X N + 1 modulo both q l and X N +1 .Note that a freshly encrypted ciphertext is at level L. As we compute further on the ciphertext, we switch down to a lower modulus via the RESCALE operation.
Our implementation of the CKKS utilizes the core polynomial arithmetic provided by A * HE which implements the Fan-Vercauteren (FV) FHE scheme (Fan and Vercauteren, 2012).The major changes, however, are the encoding, decoding and The error generated from this approximation is added to the inherent noise included in FHE and found to be negligible by experimental results.Our implementation of RESCALE can be used to divide by the last prime modulus in the prime chain p l .The prime chain is ordered such that p i < p i−1 to provide a simpler and more efficient scaling procedure, as shown in Algorithm 2. Note that our ordered prime chain alleviates the need for expensive RNS base extension (Omondi and Premkumar, 2007).
5 Performance Evaluation

Datasets
We evaluate our method on various datasets covering three common text classification tasks: sentiment analysis, spam detection and topic classification.These datasets are the IMDB Movie Reviews, Yelp Reviews Binary Dataset, AG News Corpus, Algorithm 2 RNS RESCALE by a single RNS modulus.
in RNS representation and l = l − 1 Output: ct = (c 0 , c 1 ) = q l /q l • ct 1: for i = 1 to 2 do For each polynomial in ct 2: For each coefficient 3: is used without RNS base extension.

5:
end for

6:
end for 7: end for 8: return ct DBPedia Ontology Dataset.In the results listed in table 1, we include both bi-gram and tri-gram terms to the input and use m = 500, 000 to limit the vocabulary size, and set n = 50.Training (to generate the models for PrivFT inference) was done using stochastic gradient descent (i.e.minibatch size = 1).Note that in contrast to ULMFit (Howard and Ruder, 2018) which pre-trains a language model on the large out-of-domain dataset (Wikitext-103), we only use the training dataset given in each task.On the other, gradient descent with large mini-batch size is used in PrivFT training.(Howard and Ruder, 2018)).The generated models are used in the PrivFT inference task.

CKKS Parameter Selection
The levelled CKKS includes a number of parameters that must be set appropriately to ensure both correctness and sufficient security level.First, the standard deviation of the discrete Gaussian distribution σ is set to 3.2 following the recommendations of the FHE standard (Albrecht et al., 2018).
There are three problem-dependent parameters: L, ρ and q L .In PrivFT, the inference task requires L = 5 to cater for three subsequent multiplications and vector rotations in the dot products.The training task, on the other hand, requires larger depth L = 46 levels to cater for the feed forward phase and computing the loss function all multiplied by the number of epochs (2 in our case).The CKKS computation precision ρ has a limited range practically (2 20 , • • • , 2 60 ) and can be chosen experimentally.Finally, the size of q L in bits can be estimated heuristically as |q L | = L × ρ bits.The final parameter that needs to be set is the ring dimension N which affects both performance and security level λ.N , which is a power of 2 number, has a very limited range {2 10 , • • • , 2 17 } in practical implementations.For each N , one can use the LWE estimator (Albrecht et al., 2015) to estimate the security level achieved.According to NIST recommendations (Barker et al., 2012), the minimum security level recommended for today's computing capacity is 80-bit.

CKKS Micro-Benchmarks and Comparison
In this section, we compare the performance of our GPU implementation of CKKS with Microsoft SEAL (SEAL) version 3.2, which implements an RNS variant of the CKKS scheme as well.We show the performance of basic CKKS primitives for various parameter settings.Experiments were performed on a server with an Intel Xeon Platinum 8170 CPU @ 2.10 GHz with 26 cores, and 188 GB RAM.For the GPU-accelerated implementation, we used 1 TESLA V100 (@ 1.380 GHz with 5,120 cores) and 3 P100 (each @ 1.189 GHz with 3,584 cores) NVIDIA cards each with 16 GB RAM.
Table 2 shows the latency of a number of CKKS arithmetic primitives in both SEAL and our GPUaccelerated implementation.It can be clearly seen that GPU-CKKS outperforms SEAL-CKKS in all primitives with various parameter settings.Speedup factors ranging from 1 to 2 orders of magnitude have been obtained.Of particular in-terest is the primitive HMULPLAIN that is heavily used in the inference task to multiply the encrypted inputs with plaintext weights.In particular, HMULPLAIN showed 33.78× to 244.62× improvement over SEAL.

PrivFT Inference task
We implemented the inference task of PrivFT in both SEAL-CKKS and our GPU-accelerated CKKS.The CKKS parameters used in this experiment are: (N, log 2 q L , ρ) = (8192,200,40), providing sufficient security level λ > 80 bit.Table 3 shows the latency (in seconds) of evaluating the inference task for one example from each dataset in both SEAL and our GPU implementation (here we also include the Youtube Spam Collection and Enron Email Dataset for benchmarking purposes).It can be clearly seen that the GPU implementation provides a quite reasonable run time (< 0.66 seconds) for practical applications.In conformity with the results from the preceding section, GPU provides more than 12× speedup compared to the CPU solution.We emphasize that for this task, the prediction accuracy of PrivFT on encrypted data is the same as fasttext accuracy on unencrypted data, which is shown in Table 1.

Message Size
The client is required to communicate to the server m t ciphertexts, i.e. 500000 4096 = 123 ciphertexts.The ciphertext size in bits can be estimated as 2 * N * log 2 q L .Therefore, the total message size from the client to the server is 384.375 MB.After homomorphic inference, the server sends back to the client c ciphertexts.

PrivFT Training task
The training task is much more computationally intensive compared to the inference task.To attain sufficient multiplicative depth and security level, we had to set the CKKS parameters to (N, log 2 q L , ρ) = (65536, 2300, 50).The learning rate and batch size were set to 45 and 1007500 tokens, respectively.This large batch size limits the number of model aggregates resulting in a training circuit with a shallow multiplicative depth.Under these parameters, we could not perform homomorphic training on GPU since the system memory (GPU and CPU) was not sufficient (PrivFT training task requires more than 200 GB RAM).However, we managed to run the training task for the YouTube Spam Collection dataset with Microsoft   SEAL on another system.Training took 266.4 hours (or 11.1 days) on a server equipped with Intel Xeon CPU E5-2630 @ 2.20 GHz with 482 GB RAM.The accuracy of the generated model was the same as that generated by fasttext using gradient descent with large mini-batch, that is 86.3%.

Message Size
The client needs to encrypt each record in the dataset using m t ciphertexts.The ciphertext size in this task is 287.5 MB, i.e., total message size will be r • 287.5 MB, where r is the number of records in the training dataset.After training is done, the server sends back to the client n( m t + c) ciphertexts.

Conclusions
In this paper, we proposed PrivFT: private and fast text classification solution on encrypted data using FHE.The main task of PrivFT was to perform inference on encrypted data using a pre-learned plaintext model.PrivFT can be used to implement several text classification applications such as sentiment analysis, spam detection, topic classification and document classification without compromising the privacy of the input data.We provided an efficient GPU implementation of a new FHE scheme (known as CKKS) and compared its performance with an existing CPU implementation and showed that 1 to 2 orders of magnitude speedup can be achieved.We implemented PrivFT in both CPU and GPU libraries and showed that PrivFT requires less than 0.66 seconds (on GPU) per inference on various datasets.We also showed how training a model on encrypted data can be done in PrivFT.As future work, we will try to improve the run time of PrivFT training task by fitting it to GPUs.We expect this to provide about 45× speedup in light of the benchmarks in Table 2.This requires an efficient design to decompose the problem into smaller parts and ensure that GPUs are fully utilized.

Figure 1 :
Figure 1: Fully encrypted e-mail service with enabled spam detection.

Figure 2 :
Figure 2: PrivFT prediction task.Ptxt and Ctxt refer to plaintext and ciphertext, respectively.t denotes the number of slots in plaintext or ciphertext.Shaded boxes are ciphertexts while clear ones are plaintexts.

Table 1 :
Accuracy (%) of our unencrypted fasttext model against the current state of the art (ULMFit

Table 2 :
Latency in (milliseconds) of core CKKS homomorphic operations in Microsoft SEAL and our GPUaccelerated implementation.

Table 3 :
Latency (in seconds) of evaluating PrivFT inference task with Microsoft SEAL 3.2 and our GPUaccelerated implementation for different datasets.