CloudDLP: Transparent and Scalable Data Sanitization for Browser-Based Cloud Storage

Browser-based cloud storage services are still broadly used in enterprises for online sharing and collaboration. However, sensitive information in images or documents may be easily leaked outside trusted enterprise on-premises due to such cloud services. Existing solutions to prevent data leakage in cloud storage services either limit many functionalities of cloud applications or are difficult to be scaled to various cloud applications. In this paper, we propose CloudDLP, a transparent and scalable approach for enterprises to automatically sanitize sensitive data in images and documents with various browser-based cloud applications. CloudDLP is deployed as an internet gateway within the premises of an enterprise using JavaScript injecting techniques and deep learning methods to sanitize sensitive premise data. It neither compromises the user experience nor significantly affects application functionalities in browser-based cloud storage services. We have evaluated CloudDLP with a number of real-world cloud applications. Our experimental results show that it can achieve automatic data sanitization with cloud storage services while preserving most functionalities of cloud applications.


I. INTRODUCTION
Browser-based cloud storage services are still widely used in enterprises, such as Dropbox [1], Box [2], and Salesforce [3]. They provide great conveniences for online sharing and collaboration, save the cost for data storage, and maintain high data reliability, which brings much benefits for enterprises. However, sensitive information stored in these services may be potentially leaked or shared outside of trusted premises due to vulnerabilities or misuses in insecure clouds. For example, Google discovered a flaw in Google Drive that may grant unauthorized parties access to a subset of documents under certain circumstances [4]. Dropbox and Box were also affected by similar security issues, which can allow third parties to discover private file transfer links [5]. Moreover, The associate editor coordinating the review of this manuscript and approving it for publication was Shuiguang Deng . researchers have identified several defects and misuses in Amazon Simple Storage Service (S3) that can leak military secrets [6] and private medical data [7]. Therefore, it is critical to protect sensitive information from leaking out of enterprise on-premises to insecure clouds.
Recently, various approaches have been provided to prevent data leakage in the browser-based cloud storage. Many studies [8]- [12] apply the Data Loss Prevention (DLP) technology to detect and filter out sensitive information using massive regular expressions, custom keywords, or domain specific entity corpora. They mainly support textual data rather than images and complicated documents. Message-Guard [13] creates a file upload overlay using HTML iFrames to effectively protect data at the cost of significantly reducing functionalities of cloud applications, which is not transparent to diversified functionalities of cloud applications. Moreover, the Cloud Access Security Broker (CASB) solutions [10], [14], [15] can provide protection of sensitive information. However, they are difficult to be scaled to various applications due to the need of reversing complex application-specific protocols.
In this paper, we present CloudDLP, a transparent and scalable solution to protect sensitive data for browser-based cloud applications. CloudDLP allows enterprises and users to upload encrypted or sanitized sensitive data while maintaining most of the original functionalities of the cloud applications. It is deployed as an internet gateway within the premises of an enterprise and sets between cloud applications and users. When users upload their sensitive data files to the browser-based cloud application, CloudDLP can transparently capture the data files in the requests. Sensitive information in the data files is then detected and sanitized.
CloudDLP is designed for preventing data leakage from malicious or compromised cloud applications. A key challenge in developing CloudDLP is to provide the scalability for various cloud applications. If we adapt CloudDLP to cloud applications one by one, it will be time-consuming and error-prone. To solve the problem, CloudDLP leverages the JavaScript injecting technique that instruments JavaScript snippets to the web application pages. Consequently, file requests containing sensitive data to various cloud applications can be detected.
To detect a wide range of sensitive information in the images and documents that may be shared unintentionally, CloudDLP extracts textual data from images and documents with deep learning methods. CloudDLP classifies images and documents into different sensitivity levels and assigns proper protection policies. To preserve the application functionality as much as possible, CloudDLP develops automatic sanitization method based on deep learning. Compared to traditional DLP approaches, our automatic detection and sanitization method is transparent to cloud applications and shows superior performance. With CloudDLP, enterprises can effectively prevent data leakage without compromising many functionalities of cloud applications.
To validate the feasibility of CloudDLP, we have implemented and evaluated it in ten real-world browser-based cloud storage services, such as email, storage, and office software. The experiments on these applications successfully demonstrate the effectiveness and scalability of the proposed approach. Moreover, the results show that the accuracy of the image sanitization can achieve 93.4%, which is acceptable for sensitive information extraction. The accuracy of the document sanitization can achieve 97.9%. Therefore, CloudDLP is able to protect sensitive data with various browser-based cloud applications. This paper, which is extended from our previous paper [16], has the following contributions: • We propose CloudDLP, a system that provides transparent and scalable detection and sanitization for sensitive data with various browser-based cloud storage services.
• We propose deep learning methods together with the JavaScript injecting technique to detect and sanitize sensitive data in images and textual documents, which preserves sensitive data while maintaining functionalities of cloud applications as much as possible.
• We have implemented and tested CloudDLP with various popular cloud applications. Our experimental results demonstrate the effectiveness of CloudDLP with various cloud applications. The remainder of the paper is organized as follows: we review the related work in Section II. We present CloudDLP in Section III. We present the evaluation results and case studies in Section IV. Finally, we conclude the paper in Section V.

II. RELATED WORK
Many solutions to protect sensitive data for cloud storage services have been proposed. There are three categories of related studies: Data Loss Prevention, Cloud Access Security Broker, and other work.

A. DATA LOSS PREVENTION
Data Loss Prevention is the technology of detecting and preventing data breaches. Several methods such as SPIRION [8] and CUSpider [9] adopt pre-defined keywords and regular expressions to detect sensitive information. However, these methods not only need lots of rules but also have a lower detection precision due to many false positives. Costante et al. [17] presented a data loss prevention framework based on signature-based and anomaly-based methods. The framework uses a machine learning algorithm to check user behaviors and builds malicious behavior signatures library. Gomez-Hidalg et al. [18] proposed achieving data loss prevention by Named Entity Recognition method. Unfortunately, images cannot be protected by this approach. Ong et al. [19] introduced a system that is able to detect sensitive information in the document via a deep learning model. The detection model predicts sensitive information base on semantic context analysis. These systems are unable to detect sensitive information in images or other non-document data. They are difficult to apply in practice because their performance is fairly disappointing.

B. CLOUD ACCESS SECURITY BROKER
CASB [14] is advocated by Gartner as a leading technology in cloud security. It is an approach based on a proxy server launched by cybersecurity companies such as Cipher-Cloud [10] and Skyhigh Network [15], to protect crucial data. CASB is deployed as a proxy sets between cloud applications and users. The proxy intercepts, detects and encrypts sensitive data through protocol analysis. However, they have to adopt various cloud applications by reverse engineering network protocols, which is time-consuming and laborintensive. Therefore, this approach is difficult to be scaled to various applications.

C. OTHER WORK
Song has developed ShadowCrypt [11] which runs as a browser extension, to provide textual data protection for existing cloud applications. However, ShadowCrypt can not support data files such as documents, images and so on. Mimesis Aegis [12] is applied in mobile applications and provides sensitive data isolation through a conceptual layer. It also cannot provide data file protection. CryptDB [20] is deployed as a proxy between the database server and the application server, for encrypting user confidential data. It can effectively prevent curious DBA from learning private data. But it is only applied in the database. Mylar [21] is used to protect application data based on the Meteor JavaScript framework for web developers. However, Mylars lacks of compatibility and cannot support data analysis. Virtru [22] is an email encryption system for avoiding webmail leakage. It is suitable for email service and is difficult to scale to the new applications.

III. SYSTEM ARCHITECTURE A. DESIGN GOALS
In this paper, we aim to develop methods to protect sensitive information of images and documents from leaking out of enterprise on-premises to insecure cloud storage via browserbased interfaces. Developing a approach for browser-based cloud storage services is usually not an easy task, due to the following requirements: 1. Security. The solution needs to ensure that sensitive data is protected before leaving enterprise on-premises to the cloud. 2. Usability. The method needs better preserve user experiences and application functionalities, such as document previewing and editing. 3. Scalability. The solution must be easy to maintain and be highly scalable for various applications.

B. THREAT MODEL
CloudDLP should be deployed at the edge of an enterprise network, where main security restricted policies can keep most attackers out. We assume that an internal enterprise network is secure and trustworthy. But cloud storage service providers may be malicious. Sometimes, they may compromise the security of valuable user data due to commercial interests, legal requirements, or when they are compromised. Meanwhile, the client-side application code and the middleware on the network channel may also be exploited to exfiltrate sensitive information. Additionally, we assume the operating systems, browsers, and network devices inside an enterprise firewall are trusted, which is a practical trusted zone in most organizations. CloudDLP does not provide protection against side-channel attacks.

C. CloudDLP ARCHITECTURE
In this section, we present the architecture of Cloud-DLP and briefly describe the functionality of each component. The architecture of CloudDLP is shown in Fig. 1. CloudDLP consists of five components components: (a) Interceptor, (b) Parser, (c) Classifier, (d) Sanitization, and (e) Packer. 1) Interceptor. It is able to intercept HTTP/HTTPS traffic from users on the premises to the cloud. The Interceptor can also as a ''man-in-the-middle'' to intercept TLS connections, which is authorized by the enterprise. By intercepting the traffic between users and the cloud, the interceptor can effectively prevent sensitive information from leakage to outside enterprises. Moreover, it is also responsible for injecting the JavaScript snippets into the web pages of cloud applications. The snippets used to override JavaScript native API, can intercept all the XMLHttpRequest and identify file uploading requests. 2) Parser. The Parser is used to parse application protocols, analyze semantic content of protocol and extract file content. It obtains data buffered for a user session and analyzes the request content format (including keyvalue, multi-part, etc.) to extract document fields.
3) Classifier. The Classifier is responsible for sensitive data categorization, which helps categorize stored data by sensitivity and business impact and takes different level protection policies. The Classifier adopts deep learning models to classify confidential data. 4) Sanitizer. It is responsible for detecting and sanitizing sensitive information of images and documents. This module minimizes the risk of data leakage while still keeping documents and images available for preserving functionalities such as document editing, document preview, and thumbnail preview. We also develop methods that identify sensitive data of images and textual documents via a deep learning model. Thus we adopt the scene text reading approach based on convolutional neural network (CNN) and recurrent neural network (RNN) to detect sensitive data. CloudDLP associates the discovery of sensitive data in text to the recognition of Named Entities in Natural language processing (NLP). Therefore, Sanitizer ensures that detection and redaction of private information while preserving origin functionality. 5) Packer. The Packer can assemble the redacted images and documents into HTTP/HTTPS requests. Then CloudDLP would send the requests which contain protected data, to the cloud server.
CloudDLP workflow is shown in Algorithm 1. when an enterprise user accesses a cloud service by CloudDLP, Cloud-DLP receives the requests. The interceptor component can inject the JavaScript snippets to the web pages in the requests. The Snippets would be instrumented at the head of the web pages. The native JavaScript API such as XMLHttpRequest and FileReader is overridden by the snippets. Then, the snippets start to monitor the connections and file operations invoked by the JavaScript. The file requests are identified by the associated API functions and object methods. Next, the parser parses the request format and extracts the file contents. The data file is classified as sensitive by the classifier and is sanitized by automatic image sanitization or automatic document sanitization. In the end, the packer reassembles the sanitized files into the associated requests then forwards them to the cloud.

D. AUTOMATIC IDENTIFICATION OF FILES
In order to fulfill the specified design goals, CloudDLP supports automatically new and existing applications through recognizing file requests from the cloud. In this section, we first discuss two unsatisfactory methods and present automatic identification of files via the dynamic analysis of JavaScript.
Strawman 1: The traditional method leverages regular expression matching to match file upload requests. Facing a large number of cloud applications, CloudDLP needs to analyze application protocols and maintain a large number of regular matching rules. In addition, once the application protocol changes, the corresponding rules need to be changed if c.request contains html page then 3: c.response ← C.inject_js_snippet(c.response) 4: if c.request is identified as file uploading then 5: file ← C.parse(c.request) 6: sensitive ← C.classify(file) 7: file ← C.sanitize(file) 8: c.request ← C.pack(c.request, file) 9: send(c) in time. Therefore this method is laborious and less stable in practice.
Strawman 2: We could adopt inter-procedural string analysis [23] to extract the corresponding file upload API from the JavaScript code of cloud applications. In fact, this method is difficult to extract accurate file APIs since JavaScript code usually is compressed and obfuscated by the cloud providers. When the inter-procedural strong analysis cannot extract file operation APIs, the proposed system will fail.
In reality, file transmission is always accompanied by invoking an XMLHttpRequest object in cloud applications. All contemporary browsers have a built-in XMLHttpRequest object, which defines a programming interface for data file transfer. XMLHttpRequest provides several ways to access data file through a file objects such as Blob [24], File, For-mData [25], or ArrayBuffer [26]. A file would be stored in the above file objects of JavaScript. These objects stand for a data file object. Blob and File mean the immutable file data. FormData can be consist of Blob/File object and ArrayBuffer represents the binary raw data read from a file. CloudDLP hooks and overrides the XMLHttpRequest JavaScript API via injecting JavaScript snippets. The argument type of API can be checked by the snippets for identifying file requests. Then it is easy to monitor, intercept and modify requests from XMLHttpRequest.

E. AUTOMATIC IMAGE SANITIZATION
To meet our security and usability goal, CloudDLP needs to detect sensitive information hidden in the images. To achieve this challenge, CloudDLP adopts a scene-text reading solution for detecting automatically sensitive information. The solution is divided into two phases: text detection and text recognition. The text detection phase is used to locate the area of sensitive text from the images. The text recognition phase is used to transform sensitive text areas detected in the previous phase into editable text which can be used for text processing. Moreover, the model structure is presented in Fig. 2: Text Detection Process: Most research studies of the text detection model are designed to detect extremely long 68452 VOLUME 8, 2020 texts [27]- [29] and the multi-oriented text [30]- [32]. Since the Connectionist Text Proposal Network (CTPN) model [33]- [35] performs well on sensitive texts in images which are always horizontal and low-pixel, CloudDLP applies this model to text detection. The process of text detection is as follows: (1) An input image which contains sensitive information, is divided into small blocks with 16 × 16 pixels. (2) VGG16 consisting of five convolutional neural networks (CNN), extracts features from the image. The size of the features is W × H × C. (3) An additional convolution layer is used to convert the features into sliding windows. Then each sequential windows connected to a Bi-directional LSTM (BiLSTM) for analyzing these feature sequences. BiLSTM is designed for processing sequence data, and it combines all the features. (4) The BiLSTM layer is connected to a 512 dimension fully-connected layer. The output layer with 60 hidden units predicts the center point coordinate, the fixedwidth of the proposal and scores which whether the proposal is text or non-text. Text Recognition Process: The output of some text recognition models is often corrected by a word dictionary called 'lexicon'. Although this method can improve the recognition accuracy to some extent, there is no preset sensitive data dictionary in the data protection scenario on the cloud. At the same time, sensitive information is context-sensitive. Therefore, This phase combines CNNs and RNNs into a single network, referred to as a convolutional Recurrent Neural Network (CRNN) [36]. The model is dedicated to recognizing horizontal texts and perform well without lexicon. Although this method can improve the accuracy to a certain extent, there is no preset sensitive data dictionary in the data protection scenario on the cloud. The CRNN consists of four parts: (1) All of the text regions are scaled to a height of 32 pixels by bilinear sampling.
(2) At the top of the architecture, the above text regions are fed to convolutional layers for extracting features and transforming into sequential vector representations. The max-pooling windows of the CNN are 1 × 2, which in order to directly require feature sequences from the text regions. (3) The features of the last convolutional layer are fed to the bidirectional LSTM layer. (4) The final recurrent layer outputs are read by a single feedforward layer, which estimates a label for each vector then outputs a label sequence. The label sequence is computed by Connectionist Temporal Classification (CTC) [37] and to return the final result. Next, the final result which is the recognized text, can be handled by the Named Entity Recognition model mentioned in Section 3.7, when it belongs to a sensitive entity.

F. TEXTUAL DOCUMENT CLASSIFICATION
Textual document classification provides a convenient way for CloudDLP to determine and assign proper protection policy to the data it processes. Content-sensitive and important documents need to be encrypted or sanitized, while content-insensitive and less-sensitive documents can be released without any processing. This method can greatly reduce the number of documents it processes and improves its performance. To tackle this problem, we adopt TextCNN [38] to achieve text topic classification, and then to grade the sensitivity of the document based on the text topic. The TextCNN model architecture, shown in Fig. 3. A document consists of multiple sentences and the i-th word in the sentence is represented by the k-dimensional word vector x i R k . It is worth briefly mentioning that the word vector is learned from [39]. A sentence of length n is represented as x 1:n = x 1 x 2 . . .
x n , where is the concatenation operator. A window of h words x i:i+h−1 from the sentence would be applied to a convolution operation involves a filter for producing a new feature c i . Then the filter is applied to each window of words in each sentence for producing feature maps. A max-overtime pooling operation over the feature maps and take the maximum value. These features are passed to a last fully connected softmax layer. The probability distribution over labels is output from the fully connected layer.

G. AUTOMATIC TEXTUAL DOCUMENT SANITIZATION
One promising way of protecting files uploaded to the cloud is encryption. However, file encryption can cause cloud application functionality such as online document editing and preview to fail. Document sanitization can be capable of redacting sensitive data and preserve the integrity of the document structure. Hence, sanitizing automatically sensitive information is a big challenge.
CloudDLP adopts BERT (Bidirectional Encoder Representations from Transformers) [39] published by researchers at Google AI Language Group, to identify into sensitive categories such as person names, organization, places and so on. BERT is pre-trained on a large corpus of unlabelled text which includes the entire Wikipedia (that's about 2,500 million words) and a book corpus (800 million words). And we apply BERT to automatic textual document sanitization via transfer learning on our datasets. Thus we employ a pre-trained BERT with Conditional Random Fields (CRF) architecture to sensitive entity recognition. The network architecture is illustrated in Fig. 4. The model architecture is composed of the BERT and the CRF. A document needs to break down into sentences and split into tokens. The input of BERT allows input sequences constructed by summing the corresponding token, segment and position embeddings. BERT is a multi-layer bidirectional transformer encoder, which consists of 12 transformer blocks and has 12 self-attention heads. The transformer process words in relation to all the other words in a sentence, rather than one-by-one in order. It can learn richer semantic information and perform well on the sensitive entity recognition task. BERT is followed by a linear-chain conditional random field (CRF) [40], which is used to search the labeling sequence with the highest probability. And the most likely sequence would be obtained by Viterbi decoding of the CRF.

H. CloudDLP IMPLEMENTATION
We have implemented CloudDLP as an enterprise gateway. CloudDLP has been deployed and applied to several companies for preventing data leakage in practice. The implementation of CloudDLP proxy is in C++, based on a popular open-source proxy, Squid [41]. Squid is a caching proxy and supports HTTP/HTTPS and other protocols. The JavaScript Snippets are used to monitor JavaScript XMLHttpRequest API to identify file requests. The snippets are implemented in JavaScript language to override native JavaScript API. The components such as Parser, Sanitizer, Classifier and Packer as microservices, which are easy to scale for supporting high concurrency tasks. The models of automatic image sanitization, textual document classification and automatic document sanitization are implemented using the Tensorflow [42] and Pytorch [43] toolkits. In the future, all codes of the system will be open source and publicly available.

IV. EXPERIMENTAL EVALUATION
In this section, we first estimate the performance of the image sanitization process, textual document classification process and textual document sanitization process. We focus on evaluating the accuracy and the time cost of these models. Then we discuss the effectiveness of CloudDLP in a wide variety of popular cloud applications.

A. IMAGE SANITIZATION PERFORMANCE
In the experiment, the text detection phase and text recognition phase are trained respectively. The dataset for training and evaluating text detection model are CRTW-17 (ICDAR2017 Competition on Reading Chinese Text in the Wild) [44] and ICPR Text Detection [45]. The text recognition model is trained with synthetic datasets combined with actual data protection scenarios. The synthetic datasets are generated from suitable fonts, backgrounds, and corpus, which make the datasets more realistic. The convolutional neural networks are trained by Stochastic Gradient Descent (SGD) and recurrent neural networks are trained by Back-Propagation Through Time (BPTT) in these models. ADADELTA [46] is applied to control the learning rate and batch normalization is used to speed up the training process. The details of network parameters is presented in Fig. 5, where s is step size, p is padding size, c is number of channels, and k is kernel size of the convolution layer.
The experiments are conducted on evaluating the performance of text detection model via DetEval algorithm [47] and evaluating the accuracy of text recognition model by computing Edit Distance. These experiments are evaluated on three mutually-exclusive data sets, image datasets collected from three different wildly used cloud applications. The results of our experiments are presented in Table. 1. The accuracy of the automatic image sanitization model is about 93.4% on data protection scenario. The average detection latency per time step is 4.686 s and the average recognizable latency per time step is 0.415 s. The above cost time is acceptable in practice. The actual effect of the automatic image sanitization component is shown in Fig. 6. It can detect and sanitize sensitive information such as person name, age, and ID number in medical images.

B. TEXTUAL DOCUMENT CLASSIFICATION PERFORMANCE
We scraped the publication document data from the websites and labeled the document. Then we got seven categories: ''Sports'', ''Technology'', ''Stocks'', ''Finance'', ''Government Affairs'', ''Law'', and ''Entertainment''. Each category has 80,000 training data, 10,000 verification data, and 10,000 tests data. Next, we delete the special characters in the datasets, use the stuttering word segmentation tool to segment the words and remove the stop words. In addition, we initialize word vectors with publicly available word vectors that were trained on 40 epochs over a 3.3 billion word corpus from Google, which is a popular method to improve performance. The experiment computed precision, recall and F1 rates as evaluation metrics. We use N 1 to denote the number of correctly identified as c category, N 2 to denote the number of identified as c category, N 3 to denote the total number of all categories, N 4 to denote the total number of correctly identified for each category. The metrics can be calculated as follows: Results of our model are listed in Table. 2. Our model can achieve 0.9762 Micro_Precision on textual document classification.

C. TEXTUAL DOCUMENT SANITIZATION PERFORMANCE
The 2014 i2b2 de-identification challenge data set [48] can be used to train and evaluate sensitive entity recognition model. CloudDLP needs to detect and redact sensitive health information (PHI) in the medical records. The i2b2-PHI categories are presented in Table. 3. The dataset has 1,304 patient progress notes for 296 diabetic patients. And it consists of 56,348 sentences with 984,723 separate tokens. 41,355 of the tokens are separate PHI tokens, which represent 28,867 separate PHI instances. The standard metrics used to measure the performance of the model, such as precision, recall, and F1 as defined in Equation 7-9. We use N 1 to denote the number of correctly identified PHI tokens, N 2 to denote the number of identified as PHI tokens, N 3 to denote the total VOLUME 8, 2020  number of PHI tokens.
The BERT-CRF model has 12 layers of Transformer Blocks, 768 hidden size, 12 self-attention heads and 110 million parameters. The experiment is evaluated on i2b2-PHI categories for the i2b2 data set based on token-level labels. Table. 4 summarizes the performance of our model on i2b2-PHI categories. As we can see, the model achieves an impressive precision 0.9796 and the recall 0.9836. Moreover, Fig. 7 shows the actual effect of the automatic textual document sanitization model, which detects and sanitizes sensitive information.

D. CASE STUDY
We tested CloudDLP on a wide variety of typical cloud applications that store data files, which is the focus of Cloud-DLP. The contribution of CloudDLP is providing transparent and scalable detection and sanitization and putting the enterprise back in control of their data. In this experiment, we test CloudDLP and observe the potentially affected functionalities including search, online document editing, document previews, and thumbnail previews. We are able to use CloudDLP to protect sensitive critical data while retaining the prominent functionality of these applications. We tested CloudDLP with typical browser-based cloud storage applications such as Gmail, Dropbox, Box, OneDrive, Google Drive, and Mega.nz. Sensitive information of files uploaded to the cloud, can be successfully detected and sanitized. sharing of documents, document previewing and editing can still be used normally since the sensitive information is sanitized without destroying the usability of documents or images. And the cloud application could parse and handle these protected files. On using CloudDLP with office applications, namely Salesforce, Google Docs, and Slack, The uploaded files of Salesforce Salesforce and Google Docs. Google Docs can be sanitized and protected by CloudDLP. Since the data which is numeric type is sanitized, importing the protected files into Salesforce may affect statistical analysis in the trading reports.

V. DISCUSSION
We note several limits of validity of the technique. Cloud-DLP is works for browser-based cloud storage applications, and is ineffective if client-based cloud applications are used. In addition, the sanitization may be reversed under certain circumstances. We will consider more secure sanitization methods.

VI. CONCLUSION
In this paper, we present CloudDLP, a transparent and scalable approach to protect sensitive data for the browser-based cloud applications. It can help enterprises effectively prevent critical data leakage. In addition, our experimental results show that it can achieve automatic data sanitization with cloud storage services while preserving most functionalities of cloud applications. Moreover, CloudDLP can support practical use in many real-world applications.
ZEKUN CAO was born in Anqing, Anhui, China, in 1995. He received the B.E. degree in computer science and technology from Anhui University, Anhui, in 2017. He is currently pursuing the M.S. degree in information security with the University of Chinese Academy of Sciences, Beijing. His research interests include machine learning and data security.
BINXING FANG was born in Wanning, Jiangxi, China, in 1960. He received the M.S. degree in computer science and technology from Tsinghua University, Beijing, in 1984, and the Ph.D. degree in computer science and technology from the Harbin Institute of Technology, Harbin, in 1989. He is currently a member of the Chinese Academy of Engineering. His current research interests include computer networks, information and network security, and content security. VOLUME 8, 2020