Introduction
The EU’s General Data Protection Regulation (GDPR) came into effect in 2018 and contains 99 articles and 173 recitals that apply to any company that processes or stores personal data for EU citizens even if the application is not EU-based [1]. The penalties for breaking GDPR laws in the most serious cases can be as high as €20 million or 4% of the annual turnover rate. In lesser scenarios, penalties and fines can still lead to reprimands and restrictions on obtaining and processing personal data which can become detrimental for a company or organization that needs to store personal information [2]. To protect access to sensitive information and actions, Android utilises app permissions to support user privacy [3]. While there are different base permission types in the Android ecosystem, they are characterized by a protection level that describes the risk implied in the permission. Dangerous permissions (aka runtime permissions) are one of the select range of permissions types that the user has to accept and acknowledge. In the official description on Google developer documentation [4], dangerous permissions are, “a higher-risk permission that would give a requesting application access to private user data or control over the device that can negatively impact the user”. Dangerous permissions carry the risk of revealing personal information and the identity of the user. The use of dangerous permission requires a privacy policy by law [5]. The need to access sensitive areas of a device to gain personal information is a decision taken by the application developer and must be defined in the application manifest file.1 Developers are susceptible to errors when writing privacy policies that declare the collection, usage, processing and transfer of personal information in a meaningful and transparent way [6]. Such mistakes could lead to the developer inadvertently breaking GDPR laws and receiving a heavy fine, jeopardizing the company or organization they work for and tarnishing consumer transparency concerning the handling of personal data. Several studies [6], [7], [8], [9], [10], [11] have shown that developers struggled to embed privacy into software systems. These studies suggest that software developers who design systems that collect and process sensitive user data have difficulties with incorporating privacy requirements and protocols from regulatory authorities into software applications. The lack of decision support tools for applying data protection principles, privacy reasoning, and user privacy verification in software design is cited by developers as the main deterrent to incorporating GDPR principles into software development practises [7], [8], [12].
In evaluating Android permission completeness, a large-scale evaluation of 164156 Android apps was explored in [13] and [14] to investigate whether the privacy policy matches its dangerous permission request. The investigations have shown that app privacy policies and sensitive (or “dangerous”) permission requests are not always transparent. Prior literatures [15], [16], [17], [18], [19], [20], [21], and [22] have demonstrated the discrepancy that exists in the Android and iOS ecosystem by evaluating sensitive data access through dangerous permissions, app’s code, third party library, data dissemination practices, Android API usage, app privacy policies, library inclusion and other relevant metadata. The common denominator amongst these works of literature is the investigation of the trustworthiness of the app’s privacy policies from a privacy and regulatory point of view. The conclusion of the privacy compliance analysis of mobile apps investigated in the literature is the prevalence of questionable privacy policies, inconsistencies, lack of transparency and non-compliance with regulatory requirements. A challenge that developers face is that developers must comply with privacy laws and there is no real methodology that exists to assist in the development of a privacy policy thus developers are trying to comply with regulations without the necessary knowledge of what language and explicit terms of language are needed to implement dangerous android permission-policy declarations (DAPD) [23]. This has resulted in many mobile application developers seeking guidance on Stack Overflow for the creation of compliant privacy policies [24], [25], [26]. The challenge of creating GDPR-compliant privacy policies becomes more evident as developers due to either confusion, ease of development, misuse or disregard requests for multiple permissions for the same information [27].
One way to mitigate the challenges developers face is by creating automated tools to assist small to medium-sized teams in the generation of permission policy snippets that are compliant with privacy laws. To create developer-centric solutions, this study investigates GDPR compliance of the dangerous Android permission-policy declarations used for each permission group in 270 mobile applications during runtime. The three-pronged approach investigates (i) the completeness of dangerous android permissions in fulfilling GDPR obligations, (ii) the feasibility of generating GDPR-compliant policies for sensitive permission requirements extracted from UML diagrams at design time, and (iii) evaluates if the GDPR is fit for purpose in describing android permission categories, the sensitive data requested, sensitive APIs, actions permissions represent and the semantic meaning. Since the GDPR contains articles and recitals that describe the data protection regulation an individual or organisation must comply with, while Android permission policy declarations are a developer’s attempt to convey transparently information about apps accessing dangerous permission to collect sensitive data, it is therefore, necessary to investigate whether such permission-policy snippets are coherent, explicit, accurate, concise and transparent complies with GDPR as the benchmark. As a result of this, dangerous Android permission policy statements are a verified approach for completeness checking of privacy policies and applications. The contribution of this research is highlighted below:
Completeness Checking of Sensitive Android Permissions and GDPR: To the best of our knowledge, this is the first work that evaluates completeness checking of articles of the GDPR and sentences declaring the request and usage of sensitive Android permissions. Most works of the literature [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39] evaluate completeness checking of applications and privacy policies against GDPR requirements. We investigated how well the permissions policy and categories adhered to GDPR. This was further backed by a thorough examination of the GDPR’s suitability for verifying the accuracy of Android permissions.
Empirical Analysis of the Suitability of Diverse Natural Language Processing Techniques for Text Similarity: We evaluate six NLP algorithms to measure GDPR compliance in the language and declarations used in different dangerous android permission declaration methodologies at multiple textual dimensions. The algorithms investigated are Universal Sentence Encoder (USE) [40], Sentence Bert (SBERT) [41], Glove [42], Bi-Directional Encoder Representations (BERT) word embedding [43], N-Grams, Vector Space Modelling (VSM) [44], [45] and Fuzzy String Matching (FSM).
Requirements Engineering: While other techniques have operationalized requirements from texts using statistical NLP [46], semantic frames [47], semantic parsing [48], domain-specific language [49], graphical modelling language [50], privacy-enhanced business process model and notation [51],information-flow labels [21], we used statistical NLP and UML mapping to identify permission-related requirement.
Privacy Policy Generation at Design Time Using UML Diagrams: Using modelling languages for visualising a system at design time, we implement a solution that helps developers to generate compliant sensitive permission declarations using UML diagrams (class diagrams, activity diagrams etc) during design time. First, it scans the UML diagrams and checks which permission is required based on the classes, attributes, operations and relationships between objects and generates a privacy policy declaration for the specific sensitive permission based on a specified threshold.
In this paper, the term DAPD is used frequently. By DAPD, we mean statements in the app privacy policy explicitly or implicitly describing access to dangerous or sensitive Android permission declared in the app’s manifest file. These statements are required to provide information about the sensitive data the application is collecting through dangerous permissions and how it will be processed. If the application is accessing multiple sensitive areas of a user’s device, then, it is expected to find multiple DAPDs in the app policy since the app requires permissions for each area. By DAPD methodologies, we mean the different methods, application developers are using to provide these permission-policy snippets in their app privacy policy. We use completeness and compliance interchangeably. We are also aware of the debate around the use of terminologies, privacy policies and privacy notices, which are two distinct documents. The argument has been that privacy policy is internal, while privacy notices are external and customer-facing. As a result, privacy notices are statements that explain to visitors (users) how their data will be used and their privacy rights, but privacy policies are the company’s guidelines for how employees should protect customer data.2 For the sake of this study, an external customer-facing statement prepared by app developers that outlines how the app collects uses, and shares user data is referred to as the app privacy policy.
The rest of the paper is structured as follows. Section II reviews the literature on eliciting privacy and security requirements from GDPR for system compliance, completeness checking of privacy policies and applications, and natural language processing techniques for textual similarity in GDPR. The methodology is presented in Section III including the key components of the proposed framework for runtime and design time GDPR compliance checking using Android app permissions, the datasets used and the pre-processing steps, textual similarity algorithms implemented and the similarity metric used. Section IV demonstrates the results obtained from experiments designed to evaluate the proposed compliance-checking methodology. We also discuss the practical implications of the results from a developer and platform perspective. Section V discusses the limitations of the proposed approach and future directions, while Section VI concludes the work with a summary of the key findings and future work.
Literature Review
While there are works in literature [50], [52], [53], [54], [55], [56], [57] that have focused on extracting privacy-related and software requirements from GDPR, our work is focused on assisting developers with the compliance requirements associated with Android permissions declarations and UML design based on articles from the GDPR law. We provide a literature review of two key areas that relate to our work: (i) completeness checking of privacy policies, and (ii) completeness checking of software (applications) against data protection regulations.
A. Completeness Checking of Privacy Policies
Completeness checking of privacy policies against GDPR was examined in [28] and [29] using a two-pronged approach that identifies privacy-related requirements in GDPR with privacy policies using a conceptual model of metadata traceable to GDPR articles. Abualhaija et al. [30] proposed an automated question-answering approach useful for discovering legal text passages related to compliance requirements to help requirements engineers embed privacy in the design of software systems. Lippi et al. [31] proposed CLAUDETTE, a web server that automates the detection of potentially unfair clauses in online contracts using machine learning and natural language processing on a corpus of 50 contracts, to accomplish AI-enabled consumer protection. Tesfay et al. [38] proposed PrivacyGuide, an end-user support tool for reading and understanding privacy policies using GDPR as the guide. Sanchez et al. [32] investigated the automation of privacy policy compliance as a multilabel text classification task using SVM. Each statement in a given policy is assessed and classified against each data protection goal listed in GDPR.
Using a dataset of 115 privacy policies, Mousavi et al. [39] used word embeddings, CNN and BERT for the multilabel classification of privacy policy paragraphs into predefined categories to produce a standard benchmark for privacy policy classification. Through the representation of data practice descriptions in privacy statements as semantic frames, Bhatia et al. [33] proposed an approach for identifying incompleteness in data action instances such as collection, retention, usage and transfer. By modelling data-intensive applications (DIAS) as a dataflow, Guerriero et al. [34] proposed a framework for defining, enforcing and checking privacy policies in large-scale DIAs. Elwany et al. [58] produced an Optical Character Recognition (OCR) mechanism to analyze legal documentation by leveraging a fine-turned BERT model to understand and extract text from legal corpora. Elluri et al. [59] measured the semantic similarity of different GDPR laws with cloud privacy policies. Hegel et al. [60] used NLP and OCR in legal documents to extract visual features such as layout, style and text placement to extract important pieces of information through enhanced contextual understanding. Other approaches have used crowdsourcing techniques to investigate whether data practises and privacy goals can be reliably extracted from privacy policies through crowdsourcing for the completeness of privacy policy checking [35], [36], [37].
The major advantage of approaches in this area is that they provide an automated way of verifying whether the content of a privacy policy is complete according to the provisions of relevant data protection regulations such as GDPR. By designing completeness criteria based on data privacy goals or privacy-related provisions in the GDPR, these approaches can investigate violations in privacy policies. This approach has some limitations. Firstly, they do not investigate the problem at a personal data or sensitive user actions level in the privacy policy. Solutions are developed by extracting metadata from the GDPR for completeness checking. A violation has taken place, for instance, if a controller is not named in a privacy policy. Such information is vague about which sensitive or personally identifiable information was compromised. Second, a subjective interpretation and comprehension of GDPR articles are used in the construction of the criterion. Thirdly, the GDPR identifies personal data and special categories of data in its definition, which calls for various processing requirements. However, the problem is only broadly examined by present methodologies for completeness methods. Finally, the approaches are not generalizable as the multi-domain evaluation of the metadata identification and completeness approaches have not been verified. To replicate the methodologies of completeness checking based on metadata for other data protection regulations such as the California Consumer Privacy Act (CCPA), a new conceptual model of privacy-policy metadata through systematic qualitative and completeness checking criteria for privacy policies for CCPA would be developed that feeds into developing an automated solution. This required effort hinders the replication of the proposed methodologies.
B. Completeness Checking of Applications
Users are concerned about the privacy of applications they use, especially if sensitive user data is involved, as evidenced by user reviews of COVID-19 contact tracing apps [61]. Fan et al. [62] investigated GDPR compliance violations at the app privacy and code level in mobile health applications by verifying the completeness of privacy policy, the consistency of data collection and the security of data transmission. In an exploratory study, Kununka et al. [63] examined the data handling practices and privacy policy compliance of Android and iOS apps for discrepancies. Hatamian et al. [64] studied the extent to which COVID-19 contact tracing Android apps comply with the legal requirements of GDPR. Rahman et al. [13] proposed an automated machine learning solution to evaluate completeness checking in Android applications dangerous permissions against privacy policies and highlighted the non-transparent state of permission-policy declarations of dangerous Android permissions. Shezan et al. [48] developed an NLP-driven approach, NLP2GDPR, to automatically extract text from Android applications and generate a GDPR-compliant feature. Slavin et al. [19] created an approach that identifies privacy promises in mobile application privacy policies and checks against the code using information flow analysis to see if data is extracted outside of an application thus infringing on privacy policy declarations.
The approaches in this domain have made significant progress in compliance checking of applications. This is done by investigating the compliance level of different kinds of mobile applications with legal requirements in GDPR and investigating discrepancies in applications for violations. These approaches also go beyond the app privacy policies by checking for violations in the app code and permissions. One of the major limitations of these approaches is that they have not considered the three-pronged approach of completeness checking using the permissions, app privacy policy and GDPR for a robust view. Apps require privacy policies, and those policies must be GDPR-compliant and disclose sensitive data access that requires dangerous permissions. It is this limitation that influenced the proposed research. By examining related works on completeness checking of privacy policies against GDPR and completeness checking of Applications, identifying privacy-related requirements and NLP for semantic similarity in legal documents, it was observed that little or no empirical analysis has been conducted to measure compliance of permission policy statements for dangerous android permissions with GDPR and the generation of GDPR-compliance permission policy statements at design time using UML diagrams.
Methodology
The methodology investigated in this study is an NLP-based automated compliance checking of Android permissions-policy declaration against GDPR. We discuss the proposed methodology by describing the framework, dataset collection and pre-processing, language understanding algorithms for similarity matching and the similarity metric for measuring the distance between the vector representation of the permission policy and the GDPR corpus.
A. Overview of Framework
The methodology investigated in this study is an NLP-based automated compliance checking of Android permissions-policy declaration against GDPR. The proposed approach for checking the runtime and design time GDPR compliance using Android app permissions spans four different tasks. In the first task, we extract and process the text in GDPR using natural language processing algorithms. In the second task, we process the text from the annotated corpus that matches each dangerous android permission to declarations used by over 270 mobile applications. In the third task, we perform the completeness checking of the Android permission declarations against GDPR articles and recitals. In the final task, we extract permission requirements from UML diagrams for GDPR-compliant permission policy generation at design time. In general, our approach enables an implicit compliance checking of the software using the dangerous android permissions declaration and the class diagram against the articles and recitals in the GDPR. Our work concentrates on providing automation for all the tasks. Figure 1 shows the framework for measuring the completeness of dangerous Android permissions declarations in privacy policies against GDPR laws.
Completeness checking of dangerous android permissions declarations in app privacy policies against GDPR.
We propose a novel framework that leverages the MPP-270 annotated policy corpus that maps permission and privacy policy snippets of all the 10 dangerous permission categories and every GDPR article and are compared using six NLP algorithms at five textual dimensions to calculate a cosine similarity (CS) results as shown in Figure 1. The five different textual dimensions are represented at the sentence level using SBERT and USE, BERT at the word embedding level, FSM at a pure string level, VSM at a vectorization level and BERT and GloVe vectorizations are applied at the N-Gram level. The DAPD identified with the highest cosine similarity result is extracted for each GDPR article on every algorithm for all the permission categories. As a benchmark dataset for permission completeness, the open-source MPP-270 annotated corpus was developed in [13]. We used the annotated corpus to investigate the GDPR compliance of permission-policy statements. Details regarding the dataset’s development and the human annotation process are available on the project website [14].
B. Dataset
The framework requires two input datasets - the GDPR [1] and MPP-270 [14] corpus. To measure DAPD compliance with GDPR laws, a GDPR corpus with suitable recitals was designed that contained every GDPR
C. Dataset Preprocessing
The GDPR and the MPP-270 corpus dataset were pre-processed for the N-grams, VSM, FSM and implementation of the BERT word embedding algorithms. The preprocessing steps include removing all stop words and punctuation and applying lemmatization. Lemmatization was applied over stemming for the reason that lemmatization stores more semantic context. Context is important while applying semantic similarity thus applying a form of stemming could cause the reduced words to become ambiguous or incorrect. Numbers were not removed as some articles are included with certain references to laws and directives which is considered an important aspect. For example, if a DAPD directly references a law or directive then the compliance should increase. For the implementation of SBERT and USE algorithms, removing stop words and lemmatization was not applied to maximize effectiveness and improve accuracy. This was because SBERT reads and takes into consideration the words left to right of each scanned word for each sentence to understand the sentence context. The MPP-270 dataset also had additional measures implemented to extract accurate information. For example, any value encountered in a column that was 0 or did not exist was not extracted for analysis and handled accordingly.
D. Semantic Similarity Algorithms
The goal of the semantic similarity algorithms is to extract textual entities at different textual dimensions from the GDPR and MPP-270 annotated policy corpus. The output would take one of these forms - sentences, word embeddings, strings, vectors, and N-grams depending on the encoding methods of the algorithm. We describe the choice and methodology of the six algorithms implemented in the research below.
1) Sentence Embedding
SBERT was implemented by encoding the meaning of the specified sentence with the rest of the index for both DAPD and GDPR laws. The SBERT algorithm implemented the pre-trained model
The sentence embedding techniques SBERT and USE enable the embedding encoding methodology. The outputs are completely different to each other with SBERT producing a vector embedding representation while USE outputs the results as a tensor object for each sentence. Universal Sentence Encoders have been used in [66] for encoding texts of the GDPR articles and privacy by design principles for automated text similarity tasks. Sentence embedding models have been utilised to detect dangerous Android permissions in-app privacy policies in [18]. The use of USE and SBERT have yielded highly precise annotation in [67] for semantic matching between text associated with privacy controls and user queries.
2) Word Embedding
The BERT word embedding algorithm implements the pre-trained model
The transformer architecture [68], which makes use of bidirectional self-attention, is the foundation of BERT. The BERT’s attention mechanism operates on a collection of queries (Q), keys (K), and values (V), each of which is a scaling dot product matrix. The dimension of Q and K is \begin{equation*} { \text {Attention }}(Q, K, V)=\mathrm {softmax}\left ({\frac {Q K^{T}}{\sqrt {d_{k}}}}\right) V \tag{1}\end{equation*}
Each \begin{align*} \mathrm {Multi-Head}(Q, K, V)&=\mathrm {Concat}\left ({\mathrm {Head}_{1}, \ldots, { \text {Head }}_{\mathrm {h}}}\right) W^{O} \\ \quad { \text {where $Head_{i}$}}&=\mathrm {Attention}\left ({Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}}\right) \\{}\tag{2}\end{align*}
The BERT base model (uncased) adopts Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) as training objectives to learn bidirectional representations. Given a sentence s = (\begin{equation*} \log P\left ({s_{T}\mid s_{\backslash {T}};\theta }\right) \approx \sum _{t \in {T}} \log P\left ({s_{t} \mid s_{\backslash {T}}; \theta }\right). \tag{3}\end{equation*}
3) N-Grams
The N-gram algorithm analyses and extracts the most common N-grams of two in every GDPR law and DAPD. The algorithm then identifies the DAPD N-gram with the highest cosine similarity against each GDPR law. The most common and highest cosine similarity N-grams are then embedded using GloVe. The GloVe implementation is trained using the
4) VSM TFIDF
The vector space dimension statistical approach was utilised to describe the semantic similarity between the GDPR and sensitive Android permission-policy snippets using a TFIDF implementation utilising VSM. Shahmirzadi et al. [44] used Vector Space Modelling (VSM) to extract metrics relating to patent-to-patent similarity to evaluate the performance of VSM on a variety of TFIDF variations and text similarity methodologies. According to these findings, the baseline Term Frequency-Inverse Document Frequency (TFIDF) implementation for VSM is an appropriate choice for determining text similarity, while other TFIDF versions were not beneficial. We implemented the algorithm since the findings from [44] demonstrated that TFIDF VSM is suitable for determining textual similarity at the vector level. For the cosine similarity result, the
5) Fuzzy String Matching
The FSM algorithm was implemented using the
Other methods for representing textual dimensions in the domain of topic distribution and clustering algorithms were investigated for their suitability. Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and Hierarchical Dirichlet Process (HDP) were all experimented with. Jenson-Shannon distance, Wasserstein Distance (WD) and Euclidean distance were used as distance metrics to attempt to find similarities between textual entities in the corpora. However, for these algorithms to produce findings that are dependable, stable, and consistent, a big corpus is required. Since the GDPR corpora are extensive and the MPP-270 corpus is short, this strategy was quickly determined to be inadequate. Table 2 shows the NLP algorithm techniques, methodologies, encoding method and output result.
E. Cosine Similarity
To measure the results of the USE, SBERT, BERT word embedding, N-gram and VSM algorithms, cosine similarity was adopted to interpret a measurement of similarity between indexes of the two corpora. The choice of cosine similarity measure for computing the statistical similarity between two textual entities is consistent with their effective use in document similarity tasks [73], [74], [75], [76]. Furthermore, the effectiveness of the cosine similarity measure has been validated in completeness checking tasks of mapping privacy policies against GDPR [29], [62], [77]. One of the aims of this study is to find permission-policy snippets that maximize integrability with GDPR compliance through semantic similarity, and cosine similarity seems to be the most suited for the task. Other metrics such as Wasserstein Distance (WD) were considered but this metric assumes the inputs are probability distributions while the algorithm implemented was represented using embedding and vectorization.
The cosine similarity for comparing two vectors is defined as follows:\begin{equation*} \cos ({\mathbf{X}}, {\mathbf{Y}})= \boldsymbol{\frac{ X \cdot {\mathbf{Y}} }{ \|{\mathbf{X}}\| \|{\mathbf{Y}}\|}} = \frac { \sum _{i=1}^{n}\boldsymbol{ X_{i}{\mathbf{Y}}_{i}} }{ \sqrt {\sum _{i=1}^{n}{({\mathbf{X}}_{i})^{2}}} \sqrt {\sum _{i=1}^{n}{({\mathbf{Y}}_{i})^{2}}} } \tag{4}\end{equation*}
Evaluation
This study aims to answer three key questions and sub-questions that inform the experimental design.
RQ1: Does the declaration for sensitive Android permission in the App permission policy contain meaningful and relevant information in line with GDPR articles and recitals about collecting and processing sensitive data?
Is the range of sensitive or dangerous Android app permissions supported by the Android ecosystem adequate and sufficient to fulfil GDPR obligations?
What is the level of GDPR compliance of DAPD used by developers?
Does GDPR use meaningful and relevant language to enhance the completeness checking of Android app permissions?
Are the declarations used in DAPD detailed enough?
RQ2: How can we generate privacy policies for sensitive dangerous permissions requested by an app from the UML diagram in such a way to clearly and specifically inform users about sensitive data being requested, actions the permissions represent or their semantic meaning in line with data protection laws?
Is it feasible to assist mobile app developers in the automated generation of GDPR-compliant permission-policy snippets by extracting permission requirements from UML diagrams at design time?
RQ3: Can we adequately conduct compliance by matching GDRP laws with Android permissions categories, APIs and permission-policy declarations?
To what extent is it possible to accurately classify dangerous Android permissions with GDPR?
To answer RQ1, experiments were conducted by mapping the permission-policy snippets of dangerous Android permissions for measuring completeness and compliance. To answer RQ2, UML diagrams in the form of XML data or raw PNG files are taken as input for requirements engineering and privacy policy generation for sensitive Android permissions for design time compliance. To answer RQ3, we analyse the results from the runtime analysis using permission-policy snippets in RQ1 and design time analysis using UML diagrams to measure the effectiveness of GDPR compliance at design and runtime using Android permissions.
In presenting the results, we use some terms such as average declaration, cosine similarity average, and highest average identified. The average declaration is a metric calculated for each permission category in which every GDPR article is matched with every DAPD methodology with an average calculated from the resulting cosine similarity, a cosine similarity average is then derived from the resulting cosine similarity average for each GDPR and DAPD comparison. This can be described as an average of averages. An equivalent FSM score is calculated in Table 3. The highest average identified declaration metric on the other hand takes the cosine similarity result for the highest identified DAPD methodologies for each GDPR article in each permission category.
A. RQ1: Completeness Checking of Senstive Android Permissions and GDPR
To answer RQ1, the permission-policy snippets for the dangerous android permission with the highest cosine similarity are extracted for each GDPR law using all the textual similarity algorithms (cf Section III-D) for all permission categories (cf Table 1). The result of the experiments shows the most compliant dangerous Android permission policy declarations to use for each GDPR law. Table 3 shows the results from this experiment, the highest DAPD cosine similarity result for every GDPR law is compared to the average cosine similarity result for every GDPR law to visualize the compliance to GDPR increase when using the correct DAPD methodologies. Table 4 shows the FSM results for the average FSM DAPD compliance to GDPR compared to the highest FSM DAPD methodologies for every GDPR law. It is important to note the difference in scale and sensitivity of the textual similarity score of each algorithm for measuring compliance. For example, a cosine similarity of 0.50 for GloVe might be considered very high based on the nature of vectorization. While a BERT word embedding cosine similarity of 0.60 would be described as low and 0.80 as high based on the contextual nature of BERT and its ability to find similarities of long-distance words.
Table 3 shows low compliance when the cosine similarity results are derived from using VSM TFIDIF. This shows at a vector level the methodology to declare DAPD with GDPR laws does not comply. VSM TFIDF is the equivalent of searching for a word-to-word similarity and uses a term frequency to derive a result on how important certain words are. The compliance is low as certain words contained in GDPR laws are not being used in the DAPD methodologies. The level of compliance is expected to be low at this textual dimension considering the method VSM TFIDF functions, the level of compliance in some increases significantly to the point that in the case of the STORAGE category, the compliance increased by 325% to the highest cosine similarity average of 0.34. The compliance level may be low but this means 34% of the word and the associated term frequency comply between GDPR and using the highest resulting cosine similarity identified DAPD. Such results could indicate that to raise DAPD compliance, a developer could use contextually similar words. Table 4 shows the results from both the FSM algorithms. Using the highest DAPD increases GDPR compliance substantially for both algorithms. Although more reliable results were derived from the pre-processed word embedding technique. The USE results in Table 3 do not give good results although this was expected considering the pre-trained model was trained on a question-and-answer set.
In Table 5, the highest identified cosine similarity relates to the highest similarity value identified between a GDPR law and permission-policy declarations for each dangerous android permission category. The average highest cosine similarity relates to the derived highest cosine similarity result of every DAPD vs GDPR. In contrast, the overall average cosine similarity relates to the average derived result of each DAPD vs a corresponding GDPR law. The main issue identified relates to the average result of methodologies developers use to declare DAPD which are not compliant with GDPR. The contribution of this research is the identification of the most compliant DAPD for a developer to use for each corresponding GDPR law in each dangerous permission category. Analyzing the average cosine similarity results from Table 5, it is found that the
With the SBERT results in Table 6, it is found that DAPD complies differently with different sentences in GDPR laws. Another contribution of this research is the identification of sections of GDPR sentences that are not covered or reduce DAPD compliance. Not only are the sections that reduce compliance identified, but the best DAPD methodology to comply best to that sentence is identified. Though these results express compliance issues, even the best methodologies that are used do not adequately cover certain sentences in parts of GDPR laws. This could reveal that more in-depth methodologies may be needed to comply with all sentences of GDPR laws. Table 6 represents an example in which two similar sentences in the same GDPR law use the same highest identified DAPD. The first aspect to note is that identifying the highest complying DAPD for each sentence significantly increases compliance with the GDPR law. The average cosine similarity compliance result for the first sentence in Table 6 is 0.29 while using the highest identified complying DAPD increases the compliance with GDPR to a cosine similarity value of 0.62. The second sentence in Table 6 is very similar to the first sentence but has a different context. The DAPD used is the same as the first sentence with the higher level of compliance, this shows that more in-depth details in declaring DAPD are needed for every sentence in the GDPR law to comply. The low compliance value for the second sentence shows that one specified declaration is not adequate to cover the entirety of GDPR laws. DAPD could in theory be mapped to each sentence of a GDPR law to derive the best level of GDPR compliance. The DAPD was not split into sentences to enhance investigation into the parts of GDPR that are lacking compliance when compared to the dangerous declaration permission methodology used. Some declaration methodologies are also too small to compare at the sentence level, for example,
Also, Table 6 shows two different sentences from the GDPR law
Table 7 shows the identification of the most frequent N-Grams found in the highest complying DAPD and corresponding GDPR law for each permission category. These N-grams are compared to derive a GloVe vectorization and BERT word embedding result to determine the contextual and global vectorization similarity between the N-grams. The dangerous permission categories
Based on the analysis using several algorithms, the textual similarity dimension with the highest similarity results was found to be the BERT word embedding implementation with the most accurate variant being the pre-processed implementation. Thus, a more in-depth analysis to compare GDPR and the associated text in the identified highest DAPD was conducted. The use of SBERT directly identifies where compliance is failing between each GDPR law and DAPD.
1) Developer Perspective
Do the sentences used for declaring DAPD in Android created by developers map with relevant and meaningful information in the GDPR articles? Using the highest identified policy methodologies in Table 5, each DAPD and its corresponding GDPR law will be analyzed to investigate whether the mappings are meaningful. In some dangerous permissions, the mappings between the permission policy declaration and corresponding GDPR articles are meaningful. Permission categories with meaningful mappings are
There are other cases where contextually, both pieces of data from the permission policy description and the GDPR match but they do not in any way represent the sensitive data requested by the permissions or the actions they represented. Permissions in this category are
Why do a large number of DAPD lack GDPR Compliance? While the context among GDPR laws and the highest identified DAPD are similar, some GDPR laws have different aims. From a developer perspective, the issue regarding better compliance among other methodologies for declaring dangerous android permission policy may relate to a lack of context and difficulty targeting certain GDPR laws because there are no appropriate recitals and articles that accurately capture the permission category. As demonstrated in [13], another possible reason for the lack of compliance could be that these permissions, such as
2) Platform Perspective
Is the range of permission categories used in the Android ecosystem sufficient? It is difficult for a developer to comply with every GDPR law based on the limited range of dangerous permission categories. For example, it might be difficult for a developer to comply with articles on ’Territorial Scope’ while declaring the usage of the dangerous permission category which is more focused on complying with articles such as ’Conditions for Consent’? Compliance with specific GDPR laws is more crucial than others. For instance, Article 4 GDPR - Definitions is very important as it defines different kinds of personal sensitive data protected under GDPR. This suggests that the Android ecosystem can develop permissions around these sensitive data that apply to mobile applications, and ensure that the permission-policy description aligns with the provisions in GDPR for the collection, transparency and processing requirements. On the other hand, increasing the number of dangerous permission categories may complicate and confuse the process of declaring compliant DAPD permissions for developers. However, the advantage of expanding the number of permission categories for compliance is that certain categories can be used to target crucial GDPR laws. As proven by the results in Table 5, carefully constructed DAPD can contextually comply with GDPR laws. One solution is that Google creates more dangerous permission categories based upon selections of GDPR laws thus allowing developers to target sections of GDPR. Since permissions on Android aim to support user’s privacy by also protecting access to restricted actions and not just restricted data [3], the definition of the restricted actions can be influenced by Chapter 2 (Art.5-11) Processing and Chapter 3 (Art.12-23) Rights of Data Subject to create the required dangerous permissions. Determining which articles and recitals should be targeted could lead to other compliance issues and misinterpretation. As they neatly map to important categories of personal data in the GDPR, as shown in Table 8 where Y stands for Yes and N for No, we believe that the app permission categories supported by the Android ecosystem are sufficient. The metadata from
Is it that the language used for GDPR laws is not explicit enough? From Table 5 the larger articles that cover more scope tend to have higher compliance results. Article 4 GDPR Definitions is of one the articles with the most depth and scope. The average cosine similarity compliance result among every permission category for the highest identified corresponding DAPD is 0.84. This may indicate that GDPR laws that are less explicit and have a larger scope may make it easier for a developer to comply with the GDPR law.
Do the declarations that are used for DAPD need to be longer and more detailed? The results from Table 5 which derives the highest cosine similarity compliance results with the corresponding GDPR laws and Table 7 which details the cosine similarity between N-Grams indicate that the best method to create DAPD methodologies is to structure the methodologies in a similar syntactic and contextual structure rather than using the same words. As per the results from Table 6, the methodologies to declare DAPD do not comply well with all sections of the associated GDPR law. This may indicate that to comply with a high standard, the highest identified DAPD for each sentence may need to be used and conjoined into a longer more detailed declaration. The contribution from the results of the BERT sentence embedding techniques enables this to happen thus each sentence for each GDPR law can have a DAPD with the highest identified compliance. An argument can also be made about mismatches in explicit declarations used in GDPR and the terms used in DAPD policies. For example, Table 5 shows that although contextually the DAPD and associated GDPR law are consistent, the actual aim of the GDPR law is usually completely different. For example, the DAPD in the
B. RQ2: Permission-Policy Generation at Design Time With UML Diagrams
To answer, RQ2, a proof of concept for class relationship UML design time compliance tool for automatic dangerous android permission policy generation was developed. This approach is focused on developers that use UML and is implemented using the results derived from the BERT word embedding since it generated the best results for completeness checking (cf IV-A). The sample class relationship UML image used for the Tesseract OCR Engine5 text extraction component of the design time tool is sourced from a section of a large UML diagram that was used for an actual mobile application. For the XML data input, the entire XML data of the UML diagram used for the image snippet was used as data input. The rationale for using UML diagrams relates to a design time-oriented approach in which GDPR-compliant DAPD can be generated using information during the development process rather than at the end of a development life cycle. Such a method creates a new approach to developing GDPR-compliant applications. This approach saves a developer time, reduces 8error from a developer who isn’t knowledgeable about data protection laws, reduces the likelihood of GDPR DAPD methodologies that are not compliant from being created for the privacy policy, removes costly legal fees associated with privacy policy creation, equips a developer with a tool to streamline DAPD and reveals increased transparency in the compliant methodologies a developer should use to comply to each GDPR law.
Figure 2 describes the framework for the tool where a user (developer) can either upload the UML diagram as an image or XML file and the user is prompted to select a permission category. If the input is an image, the Tesseract OCR engine is used to detect all the words in the image, while all the values associated with the text in the XML would be extracted if the input was XML data. Regardless of the type of input, the extracted text would have developer naming conventions such as camel casing and underscores removed. The result of removing naming conventions leads to the separation of individual words from the original words. Preprocessing for both the input and every GDPR law takes place in which stop word removal and lemmatization is applied and the highest identified DAPD for each GDPR law is loaded. BERT word embedding using the
Table 9 demonstrates the extraction of UML data from an image is transformed into generated permissions when compared to laws that reach a user-defined contextual similarity threshold. All three laws identified generate the permission most contextually similar based on the inputted UML data. The threshold value used in such an example was 0.09 meaning each law would need an approximate contextual similarity of around 10% to automatically generate the DAPD. Another use of the tool is that once developers upload their UML, the tool automatically scans the UML against the dangerous android permission categories, and produces as an output, the dangerous android permissions that the application requires based on the UML and also generates an optimal permission policy description based on the MPP-270 corpus that complies with GDPR. Based on these results, developers can target specific articles of GDPR of interest for compliance, and also specify specific thresholds based on business needs of their requirements engineering. With the proof of concept in Figure 2 and the results in Table 9, it is possible to automate privacy policy generation for dangerous android permissions from UML diagrams. Originally a classification model was planned but the NLP approach was more suitable due to the small MPP-270 corpus dataset.
C. RQ3: Suitability of GDPR for Completeness Checking of Android Application Permission Policies
We have demonstrated the possibility of inferring some dangerous android permissions from GDPR articles such as
To further corroborate the results from Table 5, we investigated whether the permission categories, sensitive Android API usage, the sensitive data they request, the actions these permissions represent, or their semantic meaning are implicitly or explicitly declared in the matched GDPR. As shown in Table 10, we denoted the result as NM - Not Mentioned, IM - Implicitly mentioned and EM - Explicitly Mentioned, which shows that some of the permissions in the Android ecosystem can be inferred and categorised from GDPR articles and recitals.
We argue that the GDPR is adequate for sensitive Android permission declaration completeness, as it includes Android permission policy or implicitly describe sensitive user data collection and processing. In some permission categories, the permission policy snippets can be matched explicitly with GDPR articles and recitals, and in some scenarios, the permission category is only implicitly covered in the GDPR. There are some reasons for the implicit matching in some permission categories. Firstly, some permission-relevant information from the MPP-270 could have explicitly matched GDPR articles and recitals if they contained relevant information about sensitive data collection. For example, Article 4 GDPR - Definitions provides permission policy information for biometric data which includes facial images and dactyloscopic data, which should have been directly mapped to the
Another reason for some of the Not Mentioned or Implicitly Mentioned cases in Table 10 is the language used by GDPR and Google in defining personal and sensitive user data. Table 11 shows the definition of personal and sensitive user data by Google6 and GDPR (cf Art. 4 GDPR - Definitions, Art.9 GDPR - Processing of special categories of personal data). Voice is considered as personal data under GDPR, because it is information relating to an identified or identifiable natural person, and in some cases, voice recordings may constitute biometric information under GDPR. While Google uses clear and direct data for voice data defining the microphone as sensitive user data, it is lumped under PII or biometric data under GDPR, which is ambiguous and generic. As a result, developers might find it easier to write permission policy snippets using languages that comply with Google Play Developer Programme Policies than GDPR. Another case in point is the
To further argue that GDPR is suitable for completeness and compliance checking using Android permissions, we align Google Privacy and Terms of Service (ToS) with GDPR to investigate the similarities between sections in Google Privacy & Terms matches with articles in the GDPR. Since the Android operating system which supports app permissions investigated in this study is a platform owned by Google, investigating the completeness of Google Privacy & Terms against GDPR will provide additional insights into the suitability of GDPR. There are 16 and 11 sections respectively in the Google Privacy Policy and ToS respectively. To achieve this goal, BERT embeddings were used to match the different sections in the Google terms of service and privacy policies to GDPR. The closest matching GDPR articles are then identified using the articles with the highest cosine similarity. Table 12 shows the results of the analysis of Google TOS and privacy policies against GDPR. An interesting insight is that contextually, the different sections of Google TOS and privacy policies all match with articles in the GDPR with an average cosine similarity value of 0.83, except the section on Updates in the ToS that matches with Repeal of Directive 95/46/EC. This may also prove how contextually structuring a permission-policy declaration would yield higher results as Google has contextually structured the majority of their TOS and privacy policies towards the GDPR article 4 Definitions, which is a key section of the GDPR that discusses the general provision of the regulation. We can therefore conclude that the GDPR is suitable for performing completeness for Google Privacy Policy and Terms of Service, which can be cascaded down to the Google platform such as the Android operating system that supports app permissions.
Limitation and Future Work
One of the limitations of the research is the examined annotated policy corpus. The systematic mapping between the app privacy policy and Android permissions was done by manually annotating 270 Android application privacy policies. The apps were selected based on popularity measured by number of downloads and user ratings. Firstly, there are currently over 2.65 million apps and games in the Google Play Store7 and 270 apps are not a representative of the app distribution. Additionally, apps (including games) on the Google Play store fall into 49 categories,8 however, the top 270 apps used in the corpus only covered 13 app categories. The annotated policy corpus for mapping between permission and privacy considered 30 dangerous permission APIs, however, there are 42 dangerous permission APIs on the official Android API documentation.9 This means that the coverage of the permission-policy snippet analysis for compliance was not investigated for some permission groups that are not part of the 10 considered dangerous permission categories or permissions added in newer versions of API releases. Finally, paid apps were not part of the selected apps for policy annotation. The implication of these selection biases is that permission policy behaviours might vary between apps and games, popular and non-popular apps, paid and free apps, and evaluated app categories vs non-evaluated app categories. The transparency of app privacy policies used in creating the gold standard dataset could be biased towards the selection criteria which are not a true representation of the app market. However, we argue that this limitation does not affect the findings in this research we focused on investigating the suitability of GDPR for completeness checking of permission-policy declaration. Since the corpus depends on human annotators to find permission-policy snippets in the app privacy policy for declared permission in the permission manifest file, this means that the corpus is highly subjective in interpreting privacy policies for permission transparency. This is due to the nature of privacy policy being ambiguous and subject to multiple interpretations, even among privacy and legal experts [78], [79], [80].
The semantic relationship of textual description bi-grams using GloVe, word2vec and Fasttext were investigated in [70], and the results revealed inaccuracies in the way each algorithm matches semantic and context-driven disambiguation between entities. Other findings suggested that word embedding techniques struggled in cases to produce an accurate result for words depicting similar meanings, which reveals the limitation of the technique to understand the same context that a human would interpret in certain bi-grams. Such an issue may have affected the performance of the GDPR completeness checking approach for dangerous android permission policy declaration as the word embedding techniques used may have at some point misinterpreted the semantic relationship with other words in the sentence transformer techniques or the N-Gram driven experiments. As the UML design time compliance tool is a proof of concept idea, the tool focuses solely on developers that use class relationship diagrams with UML in the software development cycle. This could alienate a proportion of developers who do not use UML during development or have a UML class relationship diagram. For permissions requirement engineering, different sources of design time elements beyond UML, such as UI textual descriptions can be leveraged. UI textual descriptions have been employed in [70] for the semantical resolution of permission request patterns in Android apps. The texts may also describe access to restricted data or sensitive action. For example, a UI text field can have a description like “Upload supporting files”, “Take a photo” “Start recording an audio message”, which are all accessing private user data or sensitive actions such as STORAGE, CAMERA and MICROPHONE protected by permissions. Regardless of the source of the design time information, whether they are UML diagrams or UI textual descriptions, we have demonstrated the relevance of our approach in automated permission policy generation. We have shown the utility of our method in automated permission policy generation, regardless of the design time element, whether they are UML diagrams or UI textual descriptions.
The solution could also be extended to other permission-declaring files such as iOS applications, browser extensions etc. A similar analysis for permission compliance could be investigated for other GDPR-like laws such as the California Consumer Privacy Act (CCPA), and Payment Card Industry Data Security Standard (PCI DSS). Concerning GDPR, an expanded empirical analysis could be conducted by implementing more textual representations thus expanding the scope past textual dimensions. The measurement of textual similarity was mapped in [81] with textual distance and representation highlighting the many combinations that can be used both textually and numerically to derive results for an enhanced conclusion regarding DAPD-GDPR compliance. The development of an application generating compliant and contextualized DAPD using machine learning based on the information from the UML could be investigated as this approach would require large amounts of data of DAPD to generate a compliant level of contextualized declarations which are unique for each UML application. The UML design time tool concept could be extended through the incorporation of a browser extension plug-in in which the tool scans the DAPD in a privacy policy and detects inadequate DAPD. Developers of applications could then be alerted if such declarations fall below a compliance threshold. This idea could be deployed by the Google Play Store as part of the approval process for users uploading applications in which the privacy policy has to have a compliant DAPD. This would require a substantial amount of training data which is not yet available. The UML design time tool concept could be expanded to include other structural and behavioural UML diagram components such as flowcharts, entity relationship databases, and sequence and activity diagrams. Collecting information in these diagrams could assist in creating a more targeted and compliant DAPD. Another future direction is investigating other pre-trained models for language understanding such as MPNet [82], which combines masked and permuted language modelling.
Conclusion
This paper investigates runtime and design time GDPR completeness checking using dangerous Android permissions. For runtime analysis, completeness checking was done by representing the permission policy declaration for each permission category requested in the app privacy policy. For design time analysis, UML class diagrams were utilized to extract permission requirements from the class elements and generate a permission-policy declaration that is GDPR-compliant. Through the results, we demonstrate the most compliant permission policy declarations for each permission category. As previously highlighted, developers lack the legal knowledge to develop compliant permission policy declarations. This paper contributes to the state-of-the-art by developing a tool to equip developers with apparatus to automatically generate compliant DAPD methodologies to GDPR and avoid non-compliant DAPD, this uses design time requirements without developer legal knowledge. We also demonstrated that the completeness of permission policy with GDPR articles could be substantially improved by applying a similar contextual structure to a targeted GDPR law rather than allocating the exact words in the DAPD. Other state-of-the-art solutions focus on generating requirements or taking already created privacy policies for textual analysis. This project combines NLP with semantic similarity to automatically generate compliant DAPD based on requirements using UML class diagrams.
One area of future work we are keen on exploring is the usability analysis of the proposed UML tool. Since the goal is to help actual developers with privacy policy generation and requirements elicitation with GDPR-compliant permission declaration using UML diagrams, a usability evaluation would help in measuring the extent to which learning and using that tool to achieve compliance goals, especially with their permission declaring systems such as browser extensions, mobile apps etc. The user’s satisfaction with the usability evaluation process will serve as feedback into the tool development process to improve its effectiveness, efficiency, flexibility and robustness. This usable study contributes to building compliance tools that are developer-friendly and developer-centric. Another area of future work involves creating a larger benchmark annotated policy corpus for permission completeness. In this study, we leveraged MPP-270 which creates a mapping between permission requested (declared in the app manifest file) and permission-relevant information in the app privacy policies, created by manually annotating 270 Android application policies. With a large annotated corpus, a classification model built on machine learning algorithms could be integrated into our solution.