Review: Privacy-preservation in the context of Natural Language Processing

Data privacy is one of the highly discussed issues in recent years as we encounter data breaches and privacy scandals often. This raises a lot of concerns about the ways the data is acquired and the potential information leaks. There are opportunities where the privacy of the data could be violated when used in Artificial Intelligent (AI) models. A considerable portion of user-contributed data is in natural language, and in the past few years, many researchers have proposed NLP-based methods to address these data privacy challenges. To the best of our knowledge, this is the first interdisciplinary review discussing privacy preservation in the context of NLP. In this paper, we present a comprehensive review of previous research conducted to gather techniques and challenges of building and testing privacy-preserving systems in the context of Natural Language Processing (NLP). We group the different works under four categories: 1) Data privacy in the medical domain, 2) Privacy preservation in the technology domain, 3) Analysis of privacy policies, and 4) Privacy leaks detection in the text representation. This review compares the contributions and pitfalls of the various privacy violation detection and prevention works done using NLP techniques to help guide a path ahead.


I. INTRODUCTION
D ATA privacy is a highly discussed issue, and we encounter data breaches and privacy scandals in our dayto-day life. This is mainly due to the collection of exponentially increasing data and the use of the data on various applications and research. This raises many concerns about the ways data is acquired and potential information leaks. We find potential risks of private/sensitive information leaks in different instances. The introduction of Machine Learning (ML) models has spiked the use of vast amounts of data for the training of Artificial Intelligence (AI) models [1]. There are many opportunities where privacy of the data could be violated when used in AI models, for example, an adversary could listen to the latent representation of the input in the ML models and obtain sensitive information. Therefore, there is an increased interest in privacy-preserving data mining techniques and privacy-preserving data analysis in recent years, protecting individual information. Preserving the privacy of training data for ML models is essential to guarantee data security and maintain user trust for continuous access to unlimited data that improve the performance of the models [1].
The sensitivity of the data can be categorized as 1) implicit information and 2) explicit information. When the information is directly derived from a user's query (e.g., web search), it is called implicit information (e.g., age, gender). In contrast, when the information is derived using pattern matching, it is called explicit information (e.g., Personal Identification Number (PIN), Social Security Number (SSN)) [1]. The traditional privacy protection methods are unable handle this growing need to protect data. They are very time and resource consuming unlike the AI models. Therefore, it is necessary to build systems that can not only provide such privacy assurances but also with increased automation and reliability [2]. The medical field has a high risk of exposing privacy details, where the records hold each patient's entire history and details. There is a potential risk of exposure to medical records while stored in the databases online or shared between institutions. Another field that is highly susceptible to privacy leakage is social media networks, applications, and software. In the past decade, we have seen enormous growth in people's interest in using social media networks, and often they do not realize the threat social media pose. Mostly the privacy policies used by software and apps are VOLUME 4, 2016 long, verbose and some exploit this situation to collect and misuse the personal information of the users [3].
Natural Language Processing (NLP) is a field that combines linguistics and computer science to analyze and understand meaning from human language. NLP is used in many applications we see in our day-to-day life, such as chatbots, voice assistants, and search engines. Researchers have proposed many techniques for solving privacy-related issues and preserving privacy in the past few years, including quantum cryptography, adversarial ML, and access control techniques. A considerable portion of user-contributed data comes from natural language (e.g., text and voice recordings), including user-privacy data. Recently, many researchers have started to apply NLP-based methods to address the data privacy challenges that result in an intersection of NLP and Privacy [1]. This makes privacy a well-motivated application domain for NLP researchers. Also, to the best of our knowledge, there is currently no interdisciplinary review discussing the intersetcion of privacy preservation and NLP.
This paper provides an overview of past works where NLP was used to identify privacy leaks, help build a system for privacy preservation, and identify techniques and challenges of building and testing privacy-preserving systems. The motivation for our review is to gain an understanding of the utilization of NLP in the privacy field. We divide the different applications into four categories: 1) Data privacy in the medical domain, 2) Privacy preservation in the technology domain, 3) Analysis of privacy policies, and 4) Privacy leak detection in the text representation. The remainder of this review is structured as follows. First, we discuss the different approaches under the four categories mentioned above. Then we present a table summarizing all the works related to privacy in NLP and the future directions we propose. Finally, we conclude with a conclusion that summarizes the review.

II. DATA PRIVACY IN MEDICAL DOMAIN
Protected Health Information (PHI) is the information in medical records or information systems that can be used to identify patients. Some examples of PHI are patient name, phone number, physician name, and medication history. Due to the medical field's advancement, there is a growing need to share medical records between institutions. Sharing data can improve clinical decision support systems, big data medical research, and treatment quality assurance [4]. However, one of the biggest challenges is the sharing and dissemination of medical records while maintaining a commitment to patient confidentiality [5]. There is an ethical and legal responsibility towards respecting the individuals' privacy which led to the introduction of specific laws that address this issue, such as the European Union's General Data Protection Regulation (GDPR) directive or the United States' Health Insurance Portability and Accountability Act (HIPAA) [6].
To secure patients' privacy, the PHI is required to be anonymized prior to sending it to another institute. Many efforts have been devoted to this endeavor, including manual and the automatic approaches [7]. Due to the recent exponential growth in the literature, the cost of manually anonymizing large data is exceptionally high. Therefore there is an increased interest in automating the anonymization procedure through the use of NLP techniques. Anonymization is considered one of the complex tasks due to the unstructured nature of clinical notes.
Here we divide the proposed systems into three categories: Rule-based, ML-based, and Deep Learning (DL)-based systems. Each has both advantages and disadvantages. Rulebased systems utilize rules and patterns to represent knowledge. They include regular expressions and pattern matching and are easy to build, maintain. However, these technologies require tedious manual labor to generate and update the rules [8] by domain-specific experts. ML-based systems use machine learning algorithms and statistical analysis for knowledge representation. Machine learning approaches including Hidden Markov Models (HMM), Conditional Random Field (CRF), Maximum Entropy Models (MaxEnt), Support Vector Machines (SVMs), Naïve Bayes (NB), and Random Forests (RFs) [9]. These have an advantage over rule-based systems as they do not require manual rule or expert knowledge, but they require labeled data for training and typically require manual feature engineering. Recently DLbased systems have obtained very high performance across many NLP tasks and do not require manual feature engineering. Two common techniques used are Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). CNNs can capture continuous local features of sequences through the convolution operation, whereas RNNs obtain long-term dependencies through the recursive process. Long short-term memory (LSTM) is an RNN that has brought more flexibility in controlling the outputs. Bi-directional Long Short-Term Memory (Bi-LSTM) is an extension of LSTMs and consists of two LSTMs and controls the flow from both directions. In this section, we describe previous works within each of these categories.

A. RULE-BASED SYSTEMS
Earlier systems used rule-based or template-based approaches to match patterns and detect PHI from clinical notes. For example, Sweeney, et al. [5], Berman, et al. [10], and Beckwith, et al. [11] proposed the concept of scrub system or tool for anonymization. Sweeney et al. [5] proposed a Scrub system for anonymization which uses two approaches to identify a PHI: a computer-based approach, which used detection algorithms competing in parallel to label the identifiers, and a human-based approach where five individuals with no medical experience or experience with the information contained in the database used a template and a set of rules to identify a PHI. Berman, et al. [10] used a concept based scrubs algorithm for a similar problem, and the algorithm works as follows: when the algorithm encounters a nomenclature term, it replaces the term by the nomenclature code and a synonym of the original term, but when it encounters another type of words it replaces them with asterisks. This method was considered safe as the output of this method contains only medical terms. Beckwith, et al. [11] designed an open-source software tool to de-identify patient information from electronic medical records, including pathology reports using a three-step process: look for identifiers associated with the patient, predict patterns likely to represent identifying data, and compare with a database of proper names and geographic locations. Recently, Iwendi, et al. [12] proposed a semantic privacy framework named N-Sanitization that effectively sanitizes the sensitive and semantically related terms in healthcare documents. First, they used dictionaries, regular expressions, and Stanford NER Tagger to detect maximum PHIs and sensitive terms. Then they used a medical ontology (knowledgebase) named SNOMED-CT to sanitize the previously detected sensitive terms by substituting them with their generalized terms. They removed the negative sentences (assertions) from documents before the sanitization process.

B. MACHINE LEARNING-BASED SYSTEMS
Named Entity Recognition (NER), also known as entity extraction, automatically identifies and classifies terms from unstructured text into pre-defined categories or classes. For example, categories in the privacy domain include names, addresses, gender, age, country, profession, or any other personal details [6]. Many past works mapped the text deidentification problem to a Named Entity Recognition (NER) problem. The entities in the text that contain the patients' personal information (entities to be de-identified) are treated as the entities that need to be extracted. The anonymization task is similar to the NER, but it is more complex as it deletes personal information and attempts to classify the personal information in the text to one of the HIPAA-defined categories [13].
Over the years, many researchers proposed ML-based approaches to achieve anonymization such as Medlock, et al. [4], Szarvas, et al. [14], Lopez, et al. [6]. Medlock, et al. [4] proposed an NLP-based text anonymization technique to preserve patients' privacy. They utilized three different strategies to achieve anonymization: a) removing the sensitive reference with a blank placeholder, b) replacing the reference with the name of its category, and c) replacing the reference with the same category pseudo reference. Following features were used to train an ML model and classify whether the cluster contains sensitive information: Part-of-Speech (POS), inner left constituent label, 2 nd inner left constituent label, outer left constituent label, outer left constituent token, and orthography. Szarvas, et al. [14] used a decision tree ML-based, iterative NER approach to deanonymize semi-structured documents such as discharge summary records. Here, the iterative learning method utilizes the information given in the structured parts of the texts to improve PHI recognition accuracy in flow text. Recently, Lopez, et al. [6] proposed HITZALMED 1 , a web-framed tool that assists with the anonymization of clinical free text in Spanish. Similar to Medlock, et al. [4], this supports identification, classification, masking, and replacement of sensitive information. Also, once sensitive information is detected, different anonymization techniques are implemented, configurable by the user. They utilized a hybrid approach that combines ML techniques to detect PHI and a rule-based system for anonymization.
In 2014, i2b2/UTHealth NLP shared task featured a deidentification track that focused on identifying PHIs in clinical narratives [15]. They introduced a newly de-identified corpus of longitudinal medical records drawn from the Research Patient Data Repository of Partners Healthcare. Popular submissions of the shared task included CRF-based systems. He, et al. [16] trained a CRF system with the following features: lexical, orthographic, and syntactic. They pre-processed their data with OpenNLP's tokenizer. Grouin, et al. [17], Liu, et al. [18], and Yang, et al. [19] utilized both CRF and rule-based approaches in their systems. The CRF-based approach of Grouin, et al. [17] included linguistic features such as surface features such as token itself, token length, typographic case, presence of punctuation or digits, and morpho-syntactic features such as POS, distributional analysis features, such as the frequency in the corpus, document section, and cluster ID based on context. They also utilized regular expressions in their rule-based approach to correct CRF outputs. The CRF-based approach of Yang, et al. [19] utilized word-token (lemma, POS, chunk), context (lemma, POS, chunk of nearby tokens), orthographic (capitalization, punctuation, regex patterns for dates, usernames), sentence-level features (position of the token in a sentence, section headers). They used dictionaries and regular expressions to identify PHI with few sample instances. The CRFbased approach of Liu, et al. [18] included bag-of-words, POS, orthography features, section information, and word representation features, and the rule-based approach used regular expressions to identify standardized PHI.

C. DEEP LEARNING-BASED SYSTEMS
DL-based NLP approaches have improved data extraction performance and require no handcrafted features or rules. Recent works have utilized DL techniques for detecting PHIs. Dernoncourt, et al. [20], Jiang, et al. [21], and Catelli, et al. [22] developed two systems based on CRFs and Bi-LSTMs for patient de-identification. Jiang, et al. [21] developed a CRF and a Bi-LSTM network-based system that focus on de-identifying psychiatric evaluation records. They manually extracted rich features to train the model for CRFs, and applied a character-level Bi-LSTM network to represent tokens and classify tags. Dernoncourt, et al. [20] used a combination of n-gram, morphological, orthographic, and gazetteer features for the CRF model. They also map each token using a character-enhanced embedding into a vector representation for the Bi-LSTM model. Dernoncourt, et al. [23] presented NeuroNER 2 , an easy-to-use NER tool based on Artificial neural networks (ANNs). They utilize the NER tool for patient de-identification entities and utilize LSTM-based RNN for non-overlapping label prediction. Furthermore, Dobbins, et al. [24] utilized the same tool used by Dernoncourt, et al. [23] to compare the performance differences across two datasets for patient de-identification. They also created a dataset specifically for this study SIRM 3 COVID-19 de-identification corpus from medical records provided by NeuroNER [23] Recently, Catelli, et al. [22], [25] focused on how different word embeddings affect the input representation. Catelli, et al. [22] built a network combining Bi-LSTM and CRF network to predict the target PHI entities. Here, they utilized the Flair contextualized and character-level language model [26], a contextualized language model, working at the character level, to capture the polysemy of words and manage the morpho-syntactic variations typical of handwritten notes. They argued that the stacked word representations capture latent syntactic and semantic similarities better. Catelli, et al. [25] further investigated the effectiveness of cross-lingual transfer learning to de-identify medical records written in a low resource language such as Italian, using one with high resources such as English while maintaining the necessary features to perform the NER task for de-identification correctly. Here, they utilized with stacked embedding consisting of MultiBPEmb [27] and Flair embeddings [26] and Multilingual Bidirectional Encoder Representations from Transformers (mBERT)-cased 4 model. The mBERT provides sentence representations for 104 languages, which are useful for many multi-lingual tasks.
Most of the proposed Bi-LSTM based models utilized only the global context to detect clinical entities and PHIs, not the local context. Therefore, Moqurrab, et al. [28] proposed a combination of CNN, Bi-LSTM, and CRF with non-complex embeddings to utilize both local and global context. Here, CNN was used to capture local context, while Bi-LSTM was used to capture global context. First, six independent CNN models are applied to extract the local context with various window sizes, then the combined local context is concatenated with the input representation and passed to the three-layered sequential Bi-LSTM architecture. Finally, the combined local and global context is passed to the CRF layer.
Li, et al. [29], Sadat, et al. [7] tried an alternative approach named frequency-filtering, to remove text that might contain sensitive terms related to personal information. Li, et al. [29] investigated the use of a frequency-filtering approach where they filter out rare sentences (frequency < 3) and sentences containing bigrams under a certain frequency threshold (frequency < 256). Their approach is based on the assumption that sentences that appear frequently tend to contain no PHI, which originates from the observation collected over many records. This approach is applicable for data anonymization from a single source. Improving the work of Li, et al. [29], 3 https://sirm.org/category/senza-categoria/covid-19/ 4 https://github.com/google-research/bert/blob/master/multilingual.md Sadat, et al. [7] extended the model to be applicable for distributed sources. Sadat, et al. [7] used frequency-based filtering to improve privacy protection on distributed sources of medical data. This framework first identified uncommon and low-frequency bigrams used to remove sentences from clinical notes containing PHI. This work also demonstrated the usefulness of homomorphic encryption for secure multiparty data analysis on medical records. Table 1 shows an overview of the works done related to data privacy in the medical domain. For each work, it shows the year the work is published, the dataset used, and the type of approach used.

III. PRIVACY PRESERVATION IN TECHNOLOGY DOMAIN
We have seen enormous growth in people's interest in using social media networks, apps, and software in the past decade. Although these social media platforms allow people to freely interact and simplify their day-to-day activities, we often do not realize how much private and sensitive information is leaked [40]. This is primarily due to the user's lack of knowledge about the risks of privacy. Previous studies demonstrated that privacy preservation is conditioned by the following reasons [41]: 1) Individuals believe that they are less exposed to risks than others. 2) Individuals consider themselves with higher skills than those they exhibit. 3) Individuals cannot evaluate the relevant risk factors as they are unaware of the most privacy risks.
Due to the above reasons educating individuals about potential privacy risks and building privacy preservation systems is essential. Many works such as Cappellari, et al. [42], Canfora, et al. [41] utilized NLP-based solutions along with ML models to detect and prevent privacy violations. Cappellari, et al. [42] proposed a method to detect messages that carry sensitive information, and they built a privacy protection framework where a client-side privacy awareness mechanism can alert users of the potential private information leakages in their communications. They employ ML methods to build a privacy decision-making tool. They utilized NLP techniques during pre-processing, such as remove stop words, replace each word with a common synonym via the WordNet lexical database [43]; and each word is stemmed to reduce the dictionary of terms to words in their root form. Canfora, et al. [41] proposed a method, and an accompanying tool, to automatically intercept the sensitive information delivered in a social network post. They recognized specific recurrent patterns used in natural language by the user to express specific privacy leakage classes using the syntactic structures and classified the classes automatically. Following are the features they used: tokenization, lowercase conversion, stop-word removal, and stemming. They ensure sentence classification performance does not change with the features' selection or training set and outperforms the stateof-the-art ML techniques. They also developed a browser Sweeney, et al. [5] Scrubbed subset of a pediatric medical record system Detection algorithms using templates and knowledge base 2003 Berman, et al. [10] Pathology free text Pattern matching 2006 Beckwith, et al. [11] Pathology reports [30] Pattern matching 2006 Medlock, et al. [4] Informal Text Anonymization Corpus (ITAC) 5
In Europe, organizations are legally bound to release contractual information containing specific personal information of individuals. Therefore, for privacy assurance, several systems are built to auto-monitor Personally Identifiable Information (PII). PII indicates any representation of information that can expose the identity of an individual same as PHI. Therefore, from here on, we use both terms interchangeably. Silva, et al. [2] proposed a system where they used NER to identify, monitor, and validate the PII. The experiments used three of the most well-known NLP tools to analyze their characteristics and capabilities: Natural Language Toolkit (NLTK 8 ), Stanford CoreNLP [44], and spaCy 9 . NLTK is an open-source Python software that allows manipulating different corpora, analyzes the linguistic structure, and categorizes text. Stanford CoreNLP is an open-source Java software containing higher-level NLP components, including sentiment analysis, dependency parsing, or NER. Finally, spaCy is an open-source software library for NLP written in Python and Cython and is considered one of the fastest NLP libraries. First, they assessed the tools' effectiveness with a generic dataset, then applied to datasets that contained any publicly available PII like names, addresses, contact numbers, or other related types. Further, they established that their method could act as a Privacy Enhancing Technology (PET) and the potential risks and associated impacts.
Nan, et al. [45] addressed the challenge in analyzing information leaks within mobile apps for automatically detecting code operating on user-sensitive data. Mobile apps usually contain semantic documentation of meaningful programs. Leveraging this documentation, the authors designed an NLP-driven solution that locates the program elements (variables, methods) and performs an ML-based program structure analysis to detect the program element of apps carrying sensitive content. Following NLP techniques were used in their approach: (i) stemming, (ii) POS tagging, and (iii) dependency relation parsing.
Other means of privacy leaks in the technical domain are malicious hyperlinks pointing to various types of viruses, phishing texts to lure individuals into providing sensitive data such as personal information, banking, and credit card details, and passwords [46]. Fattahi, et al. [46] put forward a new tool, called SpaML, for spam detection using a set of supervised and unsupervised classifiers, and two techniques imbued with NLP: Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). SpaML operates in two modes (BoW, TF-IDF) and utilizes seven supervised and unsupervised detectors: Multinomial Naive Bayes (MNB), Logistic Regression (LR), Support Vector Machine (SVM), Nearest Centroid (NCC), Extreme gradient boosting (Xgboost), K-Nearest Neighbors (KNN) and perceptron. In addition, it utilizes the majority of vote strategy to make the final decision founded on the prediction of its base learners.
Graph convolutional networks (GCNs) are a robust architecture for graph-based data representation such as citations, social networks. Nevertheless, they are prone to privacy leaks due to their training specifics. Igamberdiev, et al. [47] proposed a method to apply differentially private stochastic gradient descent and its variants to GCNs, allowing to maintain strict privacy guarantees and performance. Also, they proposed a differentially private version of the Adam optimizer. They conducted experiments on five datasets in two languages (English and Slovak), covering a variety of NLP tasks, such as research article classification in citation networks, Reddit post-classification, and user interest classification in social networks. VOLUME 4, 2016 Table 2 shows an overview of the works done related to privacy preservation in technological domain. For each work, it shows the year the work is published, dataset used, the domain of the dataset, and models used in work.

IV. ANALYSIS OF PRIVACY POLICIES
A privacy policy is a statement that explains how an organization of an app or software collects, uses, retains, and discloses personal information. This is often called "privacy notice," "privacy statement," or "privacy terms." The privacy policies mainly contain the data-use practices of an app or software. Information privacy is built on the basic principle of notice and choice, meaning users should be able to make informed decisions about what information is collected and how it should be used [54]. In other words, the policies allow the users to read and decide to use a product or service only if they find the conditions acceptable. However, most of the privacy policies are lengthy, verbose, and challenging to understand. This imposes reading fatigue on the users, which plays an active role for the user in deciding on what app/software to use [3]. Furthermore, studies show that even if users do read the policies and understand, they would often still not be able to answer basic questions about what these policies say [55]. Recently, the growing number of online services and apps with privacy policies makes the situation more complicated. In addition, some app developers/owners exploit this situation to collect and misuse the personal information of the users [3]. There have been many techniques and proposals designed to make the policies user-friendly and increase user awareness, but the semantic complexity of the privacy terms, the length of the text, and the applicationdependent variables still make this challenging. However, the above techniques are still insufficient to shape a coherent idea about app's/software's data gathering practice.
To address this, Alohaly, et al. [3] proposed an approach to quantify the amount of data collected by an app by analyzing its privacy policy text using NLP techniques, and their proposed design not only allows the users to understand the policy easily but also allow them to compare with other applications in the market based on their data gathering practices. They used NLP techniques to analyze the privacy policy, extract potentially collected "information types" or "data items," which are noun phrases associated with collection practice, and then compare them against all possible information types. Then they normalized the resulted subset and initiated a four-step quantification process: 1) locate the text segments that are relevant to collection practices 2) extract noun phrases that are potentially collected items 3) compare the extracted noun phrases with the information types in the lexicon, using similarity measures 4) count the number of collected items Studies on user preference modeling suggest that a few essential features in privacy policies largely determine the user's comfort level [56]. Researchers focused on using NLP to identify and extract essential fragments of a privacy policy to increase the ease of understanding for the user, such as Ammar, et al. [57], Sadeh, et al. [56], and Sathyendra, et al. [54]. Ammar, et al. [57] conducted a pilot experiment to estimate the extractability of salient features from website privacy policies. They combined NLP techniques and ML algorithms to extract the salient features. They utilized logistic regression, a classic high-performance probabilistic model, to map privacy policy documents to categorical labels. Both works of Sadeh, et al. [55], [56] focus on developing an NLP framework to automate the extraction of vital information from the privacy policies to enable users more control of their privacy. They combine privacy preference modeling, crowdsourcing, formal methods, and privacy interface design. Their objectives are to extract key privacy policy features semiautomatically and present them to users in an easy-to-digest format that enables them to make more informed privacy decisions. They used NLP techniques in pre-processing when crowd-sourcing reduces manual labor, filters out unnecessary text fragments and focuses on the relevant segments in a privacy policy. They also proposed augmenting crowdsourcing results with ML algorithms and NLP techniques to develop the tools needed to extract answers to privacy terms questions automatically. Xiao, et al. [58] adapted NLP techniques designed around a model to extract instances from software documents and produce formal specifications automatically. The linguistic-analysis component of their approach adapts the following NLP techniques that parse the software documents and annotate the words and phrases in the document sentences with semantic meaning: shallow parsing, utilizing domain dictionary, anaphora resolution, negative-expression identification, syntactic and semanticpattern matching. Sathyendra, et al. [54] focused on identifying and extracting choice instances automatically, which allow users to choose statements in a policy that give them discretion over aspects of their privacy. They focused on a two-stage ML procedure and treated the identification of choice instances as a binary classification problem, where they label each sentence in the text whether it contains a choice instance. They further annotated another dataset 11 and developed a hybrid model architecture to identify and label different types automatically. They used the following NLP techniques for feature selection: stemmed unigrams, stemmed bigrams, relative location in the document, topic model features, modal verbs, opt-out specific phrases, and syntactic parse tree features. They then used a two-stage architecture of ML models for classification.
Few researchers developed a corpus or lexicon (vocabulary of a language or a branch of knowledge) to support and improve the analysis of privacy policies. For example, Bhatia, et al. [59] conducted a study and developed an information type lexicon based on privacy policy annotations obtained from crowd-sourcing entity extractor based on POS tagging. Using the lexicon, they suggested performing a richer analysis of policies or measure the degree Furthermore, some app developers collect data about their users and share it with advertising companies to raise revenue, which serves as targeted ads to end-users [62]. Given the size of the app market places verifying the third-party data recipients in each policy is a tedious task. Therefore, Hosseini, et al. [62] developed an automated approach to extract and categorize third-party data recipients (i.e., entities) declared in privacy policies. They characterized the detection and classification of third-party entities as a NER problem, utilized Stanford CoreNLP for tokenization. Further, they used POS tags to identify each token, utilized Bag-of-Words (BoW) and Word2Vec [63] for vectorization, then passed into a Bi-LSTM-CRF model for classification. Word2Vec is a technique used to deliver distributed representation of words by studying the word associations.
In Europe, privacy policies are subject to compliance with GDPR. Since manual completeness checking is both timeconsuming and error-prone, Amaral, et al. [64] proposed an AI-based automation system for the completeness checking of privacy policies recently. First, they built two artifacts to characterize the privacy-related provisions of GDPR then, they developed an automated solution on top of these artifacts with a combination of NLP and supervised ML. Their NLP pipeline combines six consecutive NLP modules divided into three categories: 1) Parsing the policy text -tokenization, sentence splitting 2) Extracting information from the text -NER, regular expressions 3) Normalizing text -lemmatization, stop words removal, Finally, they utilized SVM for multi-class and multi-label classification. Table 3 shows an overview of the works described here that are related to the analysis of privacy policies. For each work, the table includes the year the work is published, the dataset used, and the domain of the policies the dataset came from.

V. PRIVACY LEAKS DETECTION IN TEXT REPRESENTATION
Writing styles vary from person to person. This variation is mainly due to the authors' background and personal attributes such as gender, age, education, and nationality [40]. Therefore, a written text often leaves enough clues that can lead to the identification of the author. This situation can lead to problems when these texts are used to train NLP models [40]: 1) Variations in the text eventually lead to significant variation in inferences across different types of corpora. Moreover, models that fit these datasets would be biased.
2) The texts in the data compromise the authors' privacy, especially data collected from emails, SMS messages, social media posts, and medical records.
3) The latent representations generated from these data can still have sensitive information, which can fall into the hands of an adversary who can reverse engineer and gain the information. Figure 1 illustrates a possible attack where an adversary could listen to the latent representation in the middle and obtain the sensitive information. For example, the classifier predicts class y from text x, and an adversary tries to recover the private information z in x through the classifier's latent representation. The naive solution for these attacks is removing protected attributes which is insufficient as other features may be highly correlated with the protected attributes [67]. Several works have been done in the past that deal with adversarial attacks NLP-based systems to prevent sensitive information leaks through representations.
Alawad, et al. [68] used a DL-based approach to automatically extract cancer characteristics from the high volume of unstructured pathology text reports of cancer registries. They used a multitask CNN method, and the privacy-preserving model outperformed the single registry model in preserving privacy. Li, et al. [40] proposed an approach for privacy-VOLUME 4, 2016 Sadeh, et al. [56], [55] Mobile apps 2014 Liu, et al. [60] Websites ranked by Alexa 12 Websites 2015 Bhatia, et al. [59] Information type lexicon Websites 2016 Alohaly, et al. [3] Crowd-sourced websites [57] Websites 2017 Sathyendra, et al. [54] OPP-115 Corpus [65] Websites 2019 Ravichander, et al. [61] PRIVACYQA [61] Mobile apps 2020 Hosseini, et al. [62] App policies from Google Play Store 13 Mobile apps 2021 Amaral, el al. [64] Web service or App policies Web service FIGURE 1. An illustration of a possible attack situation [66] preserving learning of unbiased representations to explicitly obscure individuals' private information. They employed adversarial learning models inspired by Ganin, et al. [69] for domain adaptation. This suggests that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. They jointly learn a discriminator model along with a supervised model and aim for a good prediction of the target and a poor representation of the sensitive information.
Coavoux, et al. [66] proposed a metric to measure the privacy of the neural representation of input for many NLP tasks such as sentiment analysis and topic classification. The metric they used is based on an attacker's ability (performance of the attacker's classifier) to recover information about the input from the latent representation. They presented three defense mechanisms designed against this type of attack by minimizing some measure of information and making it hard for the adversary to predict three training methods: multidetasking, adversarial generation, and de-clustering.
Both the above works provide only empirical improvements in privacy without any formal guarantees. Therefore, researchers moved into building systems in the context of Differential Privacy (DP) that provides formal privacy guarantee of the extracted representation from the user-authored text [70]. Lately, DP has become a de facto standard for privacy analysis, where researchers introduce noise into the data to make data related to specific people more difficult to trace. DP algorithms guarantee that the algorithm's behavior hardly changes when a single individual joins or leaves the dataset. Lyu, et al. [70] proposed a novel approach called Differentially Private Neural Representation (DPNR), which utilizes DP to provide a formal privacy guarantee. They introduced a DP noise layer to preserve the extracted test representation's privacy without degrading the main classification task. They controlled how much noise to add for the robust algorithm through this layer. Fernandes, et al. [71] combined ideas from "generalized DP" and ML techniques to model privacy for text processing. They demonstrated how to use ideas from differential privacy to provide strong a priori privacy guarantees in document disclosures. Here, they used BoW for text document representation as they contain sufficient information for the representation, and they used d X − privacy [72] a metric-based extension of differential privacy, to implement an automated privacy mechanism. The mechanism takes the BoW as input and produces noisy BoW outputs.
Pre-trained contextualized language models have been shown to increase the performance of several NLP tasks, but existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation [73]. Yue, et al. [73] built a privacy-preserving NLP (PPNLP) pipeline to address privacy from the root to produce sanitized text documents directly. Here they sanitize the public data before feeding them to training because they believe it prepares the model to work with sanitized queries, increasing accuracy. They proposed two token-wise sanitization methods: SAN T EXT and SAN T EXT + , which were built atop a variant of the exponential mechanism (EM) [74] to avoid going to the "cursed dimensions" of token embeddings. Finally, they passed the output token into BERT for classification.
Recently in NLP, building general-purpose language models such as ELMo [75], BERT [76], and Generative Pretrained Transformer-2 (GPT-2) [77] to convert text to vectors has become successful. Nevertheless, these embeddings from general-purpose language models would also capture much sensitive information from the plain text and be a potential risk. Pan, et al. [78] is the first to present a systematic study on the privacy risks of eight state-of-the-art language models by constructing two novel attack scenarios such as pattern reconstruction attacks and keyword inference attacks. Pattern reconstruction attack aims to recover a specific segment of the plain text with a fixed format like date of birth or gender, and keyword inference attack aims to infer the sensitive information using the existing words in the text. Through the study, they confirm the existence of privacy risks. Also, they proposed four different defense mechanisms to obscure the unprotected embeddings for alleviation purposes as follows: 1) Rounding -Apply floating-point Rounding on each coordinate of the sentence embeddings for obfuscation. 2) Laplace Mechanism -Perturb the embedding coordinate-wise with samples from a Laplace distribution whose parameters are determined by the sensitivity of the language model. 3) Privacy-Preserving Mapping -Apply adversarial training as mentioned by Li, et al. [40]. 4) Subspace Projection -Remove the unwanted subspace that encodes the keyword's occurrence from the universal sentence embedding space. Table 4 provides a summary of the works done to prevent privacy violations in the learned text representations. The table shows the year, authors, state-of-the-art datasets used for experiments, and the NLP tasks the datasets were evaluated for each paper.

VI. DISCUSSION
In this section, we provide a summary of all works we discussed in the above sections and our novel insights in the future directions we can take to tackle privacy issues with NLP-based techniques. Table 5 provides an overview of the works done in the privacy domain using NLP techniques in chronological order. For each paper reference, the table shows the year, authors, and the paper's main objectives. For ease of understanding, we grouped the papers into four categories as discussed in the above sections as follows:

A. SUMMARY
• A -Data privacy in the medical domain • B -Privacy preservation in the technology domain • C -Analysis of privacy policies • D -Privacy leaks detection in the text representation

B. FUTURE DIRECTIONS
So far, we have discussed previous works that used NLPbased techniques to address the data privacy challenges.
Here, we discuss some privacy-related issues and the future directions we propose to utilize NLP techniques in privacy preservation. This review focuses on the de-identification or anonymization of personal identifiers in the medical and technological domains. However, there are other domains where documents or artifacts are shared between institutions that contain personal identification details such as financial documents, Curriculum Vitae (CV), resumes. Therefore, the similar techniques that we discuss here can be enhanced to be applied for data from other domains. Also, here we focus on the personal identifiers only, but researchers could apply these techniques to identify quasi-identifiers. These quasi-identifiers are not unique identifiers themselves, but they create a unique identifier that correlates with specific entities.
During this review, few studies explored utilizing different word embeddings to capture different aspects of the input representation, such as Flair embeddings and MultiBPEmb embeddings. We should further explore utilizing different word embeddings, especially the deep contextualized word embeddings such as ELMo [75], BERT [76]. Since most of the datasets belong to the clinical or healthcare domain, we can specifically use BioBERT [91] or Clinical BERT [92]. Pre-trained word embeddings trained on these large-scale data help to represent the token more efficiently.
In the future, we can explore the possibility of utilizing transfer learning when studying data where we do not have much data. For example, most of the clinical or healthcare datasets we use to study ways to secure patients' privacy are smaller than other domains. Catelli, et al. [25] investigated the effectiveness of transfer learning across languages. It would be interesting to explore transfer learning to learn on datasets with more instances and test on the dataset with fewer instances. Also, we can investigate new optimizations that can reduce the resource requirements and training data to analyze domains where we do not have much data [90].
In the course of our review, we noticed there had not been any research using NLP-based approaches for privacy preservation in Twitter data. Twitter is the second most popular social networking site, and Twitter data is used for research purposes in multiple domains such as political campaigns, movie reviews, industry-related reviews. These data can carry sensitive information about the users that can be exploited. NLP-based techniques can be used to remove or anonymize the personalized information from the tweets.
Another area we would like to focus on user data privacy is the location privacy of the users. Many apps and social media networks track the location details of the users. An adversary can use this data to link records of the same individual, study and predict the movement patterns of an individual, identify points of interest that can endanger a targeted individual [93]. In the future, more research should focus on preserving the privacy information from these data, and many NLP techniques can be applied to identify and extract user's location privacy information and normalize so that the information does not fall into the wrong hands.
Furthermore, we discussed developing user-friendly privacy policies. In the future, we can focus on improving the usability of privacy policies by extracting relevant data practices and making them more accessible to users. We can use information extraction techniques utilized in NLP-based VOLUME 4, 2016  In the recent past, there was an urgency to manage and find cures for the COVID-19 pandemic. It was necessary to share large volumes of data between national and international organizations to share information for the studies [94]. We should look into efficient organizational and technical measures to remove or replace PIs in the Big Data applications in the era of COVID-19. We can conduct an inter-domain study to investigate ways to combine with NLP to increase efficiency.

VII. CONCLUSION
This inter-disciplinary review categorized state-of-the-art research in the privacy domain that utilized NLP-based techniques into four categories. We investigated methods to protect patients' health information in the medical domain through PHI anonymization and de-identification techniques. We analyzed techniques to educate individuals about potential privacy risks and building systems for privacy preservation in social media networks, software, and apps. We further looked into designs to make the policies user-friendly, increase user awareness, and quantify the sensitive information in the policies. Finally, we studied methods that prevent an adversary from listening to the latent representation in the middle and obtaining sensitive information.

Category
Year Authors Objective A 1996 Sweeney, et al. [5] utilized detection algorithms for PHI anonymization A 2003 Berman, et al. [10] proposed a concept-based scrubs pattern matching for PHI anonymization A 2006 Beckwith,et al. [11] designed a pattern matching tool for PHI anonymization A 2006 Medlock, et al. [4] proposed a feature extraction technique for PHI anonymization A 2007 Szarvas, et al. [14] proposed decision tree-based pattern matching approach for PHI anonymization C 2012 Anmar, et al. conducted experiment to estimate the extractability of salient features from privacy policies C 2012 Xiao, et al. [58] proposed approach which adapts NLP techniques to auto-extract instances from software documents C 2013 Sadeh et al. [55] proposed algorithm to answer privacy questions of users semi-automatically C 2014 Sadeh et al. [56] developed NLP framework to auto-extract vital information from privacy policies C 2014 Breaux, et al. [90] mapped privacy requirements to a formal language description C 2014 Liu, et al. [60] contributed to an improved annotated dataset for pairwise evaluation of automatic methods A 2014 He, et al. [16] proposed a CRF-based system for patient anonymization in clinical narratives A 2014 Liu, et al. [18] proposed CRF-based pattern matching system for patient anonymization in clinical narratives A 2014 Yang, et al. [19] proposed CRF-based pattern matching for patient anonymization in clinical narratives A 2014 Grouin, et al. [17] proposed CRF-based pattern matching system for patient anonymization in clinical narratives A 2015 Li,et al. [29] proposed frequency-filtering approach for patient anonymization C 2015 Bhatia,et al. [59] developed information type lexicon based on privacy policy annotations C 2016 Alohaly, et al. [3] proposed algorithm to quantify the amount of data collection of an application C 2017 Sathyendra, et al. [54] built a 2-stage classifier using feature selection A 2017 Jiang, et al. [21] proposed a CRF and LSTM-based system for patient anonymization in psychiatric evaluation records A 2017 Dernoncourt, et al. [20] proposed CRF and LSTM-based systems for patient anonymization in psychiatric evaluation records A 2017 Dernoncourt, et al. [23] designed a CRF and LSTM-based for patient de-identification A 2017 Dobbins, et al. [23] utilized a CRF and LSTM-based tool [23] to compare performance of datasets A 2017 Dernoncourt, et al. [20] proposed a CRF and LSTM-based system for PHI anonymization B 2017 Cappellari, et al. [42] built privacy protection framework with ML algorithms B 2018 Canfora, et al. [41] designed tool with ML algorithms to intercept private information in social media post B 2018 Nan, et al. [45] proposed pattern matching -based solution to auto-detect the code operating on private data in mobile apps D 2018 Li, et al. [40] proposed approach to train model for adversarial training in parallel D 2018 Coavoux, et al. [66] proposed metric to measure the privacy of the neural representation of input text C 2019 Ravichander, et al. [61] built a corpus for QA methods in privacy domain A 2019 Sadat, et al. [7] proposed homomorphic encryption for secure multi-party data analysis D 2019 Fernandes, et al. [71] proposed a combined approach of generalised DP and ML to model privacy for text documents D 2020 Lyu, et al. [70] proposed representation to formally quantify DP D 2020 Pan, et al. [78] presented systematic study on the privacy risks C 2020 Hosseni, et al. [62] developed an automated approach to extract and categorize third-party data recipients B 2020 Silva, et al. [2] proposed NER using NLP-based tools to identify, monitor and validate PII D 2020 Alawad, et al. [68] designed privacy-preserving model using CNN A 2020 Lopez, et al. [6] designed pattern matching, dictionaries, and ML-powered web tool for auto-detection of PHI A 2020 Iwendi, et al [12] proposed semantic privacy framework to effectively sanitize sensitive terms in healthcare documents D 2021 Yue, et al. [73] proposed two token-wise sanitization methods for text sanitization B 2021 Fattahi, et al. [46] proposed a tool for spam detection B 2021 Igamberdiev, et al. [47] applied differentially private stochastic gradient descent to GCNs to maintain strict privacy guarantees C 2021 Amaral, el al. [64] proposed an AI-based automation system for the completeness checking of privacy policies A 2021 Catelli, et al. [22] combined contextualized word representation and sub-document level analysis for clinical de-identification A 2021 Catelli, et al. [25] cross-lingual transfer learning to de-identify medical records A 2021 Moqurrab, et al. [28] proposed model uses local and global context to extract clinical entities