Measuring Software Obfuscation Quality–A Systematic Literature Review

Software obfuscation techniques are increasingly being used to prevent attackers from exploiting security flaws and launching successful attacks. With research on software obfuscation techniques rapidly growing, many software obfuscation techniques with varying quality and strength have been proposed in the literature. However, the literature on obfuscation techniques has not yet been coherently collated and reviewed. This research paper aims to present an overview of state-of-the-art software obfuscation techniques, focusing on quality and strength. A systematic analysis and synthesis of literature published between 2010 and April 2021 has been performed to identify the common measures to quantify obfuscation and their measures, the publication venue, and the home country of the researchers. We have identified the obfuscation quality attributes, such as potency, resilience, cost, stealth, and similarity, that are the most widely used metrics to evaluate the quality of obfuscation techniques. In addition, different measures have been used to quantify these qualities, such as complexity (to measure potency), human effort (to measure resilience), efficiency (to estimate cost), and multiclass performance metrics, distance measures, and matching method (to quantify similarity). These measures were then categorized into sub-measures. The literature lacks research in the following two areas: empirical research using a case study strategy, i.e., real-world datasets, and measurements of obfuscation stealth. Researchers did not address stealth as clearly as they addressed potency, cost, and similarity.


I. INTRODUCTION
Software obfuscation is a technique that obscures the structure and/or behavior of software code without impacting its expected functionality such that the code is rendered hard to understand, analyze, or reverse engineer [1], [2]. Software obfuscation techniques are developed for both malicious (e.g., evading automatic static code inspection) and benign (e.g., protecting code privacy or intellectual property) purposes [3]. For example, Malware authors use software obfuscation to evade detection and thwart inspection and removal. Although there is no guarantee that the obfuscated code will be completely immune to reverse engineering, obfuscation increases the effort or cost required to learn the obfuscated code's functionality [1], [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Chunsheng Zhu .
Various obfuscation methods and techniques have been proposed in the exiting works, ranging from syntactic techniques (e.g., inserting opaque predicates) to semantics-based techniques (e.g., complicating control flows). These obfuscation techniques differ in quality and strength. Despite this, there is a lack of measurement in this field to report the quality and strength of the proposed techniques [4], and it remains unclear how ''good'' these techniques are [5]. Measuring the quality of obfuscation techniques is not only useful but also necessary because the obfuscator cannot tell whether the obfuscation is effective if there is no measure of its efficacy. In addition, the developers can improve obfuscation quality by increasing the number and type of obfuscation techniques they use. Therefore, measurement is required for at least evaluating and controlling the obfuscation because, as DeMarco's rule states, ''you can neither predict nor control what you cannot measure'' [6]. In such a context, measurement is concerned with quantifying an attribute of an obfuscated code. It is nothing more than a number or symbol that is assigned to the obfuscated code to characterize its quality [1]. From the literature, the most common obfuscation quality attributes are cost, resilience, potency, stealth, and similarity. Although such attributes go back to the 1990s, researchers have shown a genuine interest in this area and become competent to discuss obfuscation quality over the last ten years. This interest stems from the fact that the amount of malicious software has increased over the last few years, as approximately 1 million malwares are created daily [7]. This issue has become worse with the adoption of pervasive computing and as technology has shifted toward an Internet of Things.
This review article focuses on the measures used to quantify obfuscation quality in the past decade. To this end, we conduct this systematic literature review (SLR) after analyzing exhaustively relevant studies. The paper makes the following main contributions: • We perform this SLR according to the key features of measuring software obfuscation quality.
• We provide a systematic overview of the current studies on the area under study.
• We divide the quality attributes of software obfuscation into five categories: potency, resilience, cost, stealth, and similarity. Thereafter, we classify the exiting measures used to quantify the obfuscation quality into 44 sub-measures.
• We make the discussions and provide directions for further studies in measuring obfuscation strength. The rest of the paper is structured as follows. Section 2 briefly describes the existing work, and the systematic review itself is presented in Section 3. The findings are discussed in Section 4. Section 5 discusses some threats to the study's validity. We conclude the paper in Section 6.

II. EXISTING SYSTEMATIC REVIEWS
Although useful, existing work has not put significant focus on obfuscation quality and strength measurement. In the following, we will discuss several review articles on obfuscation transformation techniques.
A research by Humayun et al. [8] was conducted to evaluate the most common cyber security threats based on 78 primary studies. The results showed that as the current security approaches target security generally, more empirical validation and actual implementation is needed for the solutions presented in these studies. Additionally, their findings revealed that most of the research focused on a small number of ordinary flaws, for instance, social engineering, denialof-service attacks, and malware. Novak et al. [9] presented an extensive SLR in academia to detect the similarity of source-code. They looked at 150 primary studies from various angles, including automated tools for detection, performance metrics, methods for obfuscation, deployed datasets, and the types of algorithm used. The multiclass metrics were the most common metrics for assessing system quality and comparing similarity. Khoshavi et al. [10] conducted a survey in stack cache memory modules to present a taxonomy of attack vectors, most of which were side-channel attack points. The authors discussed side-channel attacks, including how they work and how obfuscation techniques can help to prevent them. The findings included novel variations of side-channel attacks to perpetrate attacks on systems, as well as countermeasures against the attacks. Asadoorian et al. [11] outlined the best practices for security throughout each development stage in the software development life cycle (SDLC). He suggested using the following two types of obfuscation techniques in the coding stage: branch insertion/opaque predicates and obfuscation with random numbers. Hataba and El-Mahdy [12] briefly explained the well-known obfuscation techniques that use various transformations (e.g., layout, control-flow, data, and preventive transformations) and how to implement them. Additionally, they discussed how to evaluate these techniques using a set of reasonable criteria, including potency, resilience, and cost.
Hosseinzadeh et al. [4] conducted an SLR of diversification/obfuscation techniques. They collected 357 relevant articles that had been published between 1993 and 2017. There are several techniques to obfuscate a piece of code, each of which is used at various stages of the SDLC and targets different sections of the code. Research gaps included the following: (a) different execution environments still need to benefit from obfuscation techniques; and (b) measurements of the effectiveness of obfuscation. Pan et al. [13] performed an SLR to review Android malware detection using a static analysis. They gathered 98 studies published from 2014 to 2020. The static analysis categories included Android characteristics, opcodes, code graphs, and symbolic execution. They concluded that, in terms of detection, the neural network approach outperformed the non-neural network approach. Furthermore, there is still a need to improve identification through the use of novel techniques and the establishment of a single framework for performance evaluation. In the study by Wang et al. [14], a SLR was used to identify malicious apps by analyzing the behaviors of apps using different features. The authors examined the extracted features, the feature subset selection techniques, the detection approaches, and the scale of performance evaluation. With a wide use of code obfuscation, extracting useful features from the code using a static analysis is quite difficult. In contrast, when it comes to feature extraction of malware apps, a dynamic analysis outperforms a static analysis. Lu [15] reviewed 14 cybersecurity papers published between 2008 and 2017. The authors divided the articles into individual, employee, and organizational categories. These authors painted a picture of the state of cybersecurity and, using the R Project, created an integrative framework that could potentially text-mine endusers' security behaviors and decision-making processes in the event of a security breach. Balakrishnan and Schulze [16] reviewed code obfuscation techniques that could be applied at the following different abstraction levels: data abstraction, procedural abstraction, data types, and control-flow. VOLUME 9, 2021 Following the discussion above, measuring software obfuscation quality have not attracted the attention of sufficient researchers. Our work aims to bridge the gap by providing researchers with an overview of the existing obfuscation quality measures that are commonly used to quantify the quality of obfuscation techniques.

III. RESEARCH METHODOLOGY
Recently, the term ''systematic literature review'' has appeared in the title of various information security research papers. It is used in parallel with terms, like ''systematic mapping study'' and ''systematic review'' (for examples, see [4], [8], [9]). A systematic literature review is used for evaluating, identifying, synthesizing, and interpreting a specific research question or subject [17]. As a result, a systematic literature review can identify any gaps in the existing works and suggests areas for further research. In our study, we adopted the suggested guidelines in order to carry out a systematic analysis in [17]- [19]. The study is unbiased and repeatable by other researchers since we follow specific guidelines/protocols [20], which differentiates the systematic review studies from other types of reviews.
The protocol of systematic literature review has five stages. Figure 1 depicts the five steps that were followed here.

A. QUESTION FORMULATION
To narrow the research target, our systematic review addresses the six key research questions (RQs). Table 1 displays the (RQs) along with their motivations.

B. SEARCH STRATEGY
The keywords used to answer the above RQs were as follows: software, code, program, obfuscation, obfuscate, obfuscator, measurement, measure, metrics, and metric. By taking these keywords and combining them with the ''AND'' and ''OR'' logical connectors, we used the following search command to find relevant articles from the academic libraries under consideration: (software OR program OR code) AND (obfuscation OR obfuscate OR obfuscator) AND (measurement OR measure OR metric OR metrics) C. SOURCE SELECTION By using these strings, we can find studies on obfuscation measurement. The above search strings did not aim to search for certain measures applied at a specific level.
The five sources with which our systematic literature review was carried out are SpringerLink, IEEE Explore, Sci-enceDirect, Wiley Online Library and ACM Digital Library. All the sources include the important journals and conferences in the field, in which significant proposals for obfuscation metrics are presented. It is likely that the primary studies would then include the number of works presented in these sources.

D. STUDIES SELECTION
Our systematic literature review procedure will be iterative and incremental. It is iterative because the systematic review execution is conducted on one search source first, and then on another one. It is incremental; the systematic review document grows with each iteration until it becomes the definitive one, so that its implementation originates from an 99026 VOLUME 9, 2021 initial source to more sources until the entire review has been performed. In the inclusion criterion, we included studies based on review of the title, abstract, and keywords from the articles acquired in the search. Moreover, the systematic review included studies that were written in English and published between 2010 and April 2021. According to [8], most cyber security threats and crimes were reported after 2007.
We applied the following exclusion criteria to the considered abstracts/titles: • To exclude any paper that does not present an empirical study. we classified the empirical studies to three research methodologies, experiment, case study, and simulation. Simulation methodology was mostly used in the primary studies for validation.  • To exclude any paper that does not discuss measuring the obfuscation quality, as this is our area of study.
• To exclude MS/PhD theses, posters, technical reports, and short papers (with less than three pages).
• To exclude the repeated studies.
If we were unsure about a paper after reading its title and abstract, we read the whole the paper. The procedure for retrieving data for the selected studies is shown in Figure 2. Using this procedure, 302 studies were found; 67 of those were selected as primary studies based on the inclusion and exclusion requirements. The specifics of each iteration are shown in Table 2, and possible references within the chosen studies were not sought for.

E. DATA EXTRACTION
As shown in Figure 2, once the relevant studies were selected, the needed information was extracted. An information extraction form (available in Appendix A) was designed to retrieve information from the primary studies. It was reviewed and agreed by all three authors. Besides the data about the selected paper (e.g., title, authors, publication year, and publication country), the form recorded data on the quality assessment of the article such the two main types of threats to the validity: internal validity and external validity. 1 Each threat was scored ''yes'' or ''no'' depending on whether the study explicitly explored the possibility of threats (internal or external validity). Most of the primary studies (59 out of 67) did not mention the internal and external validity, i.e., did not score ''yes'' for both attributes.

F. QUALITY ASSESSMENT
We used the strategy described in Kitchenham and Brereton [17], Kitchenham et al. [19] to evaluate the quality of the papers. The same approach has also been used in several SLR studies [98]. Specifically, to ensure that the studies included contribute significantly to the SLR, we developed a checklist 1 The internal validity is the extent that the independent variable affects causality. The external validity is a condition that limits the ability to generalize the study's results [21]. criterion based on the guideline proposed in Kitchenham and Brereton [17], Kitchenham et al. [19]. The checklist criterion consists of five quality appraisal questions for judging the quality of the studies as shown in the table (Appendix A). Besides the information extraction form (available in the appendix) was reviewed and agreed by all three authors, we assessed the quality of each primary study by means of the attributes: internal validity and external validity. Each was scored 'yes' or 'no' depending upon whether the study explored explicitly the possibility of threats (internal or external validity). In our study, most of the primary studies (59 out of 67) did not score 'yes' for both threats. This was clearly mentioned in the previous version.

IV. RESULTS AND DISCUSSION
First, we needed to find the most recent research trends in the field of measuring obfuscation strength in the security community. The studies distribution by year is shown in Figure 3. There is a notable increasing number of studies that deal with the subject. It shows that the number of papers related to this area increased after 2010 (when most cyber issues were reported [8]). The number of studies in 2019 takes the largest proportion. The little downward trends in 2020 comes from the fact that not all empirical studies conducted in 2020 were published during this review article.

A. ANALYSIS OF THE PUBLICATION VENUE OF (RQ1)
The first aspect of this article focuses on addressing RQ1, i.e., the source place of selected publications that take an active part in the area of obfuscation measurement. For this analysis, we chose five digital libraries as the main venues, as described in Table 2. The selected studies were published in three types of publications, namely conferences, journals, and workshops/symposiums. Table 3 presents the distribution of the primary articles based on the source type. The proportion of research that have been conducted in conferences and journals was (c. 49, 73%), while (c. 18, 27%) was published in both workshops and symposiums.
The results in Table 3 show that (c. 58, 87%) of primary papers were retrieved from IEEE and ACM libraries. The number of papers in both libraries is almost the same (c. 29, 43%). In contrast, the IEEE library contains more conference papers than the ACM library (17 versus 8). In addition, all workshop papers and symposium papers were extracted from ACM and IEEE libraries except one paper from Springer. Like in previous systematic review studies [8], [20], the contribution of the three other libraries -Springer, Wiley Online, and ScienceDirect-is smaller than that of IEEE and ACM libraries. In these three libraries, the primary studies relevant to the obfuscation measurement were almost published in journals. Additionally, the frequency of papers in ScienceDirect was the highest, with (c. 5, 7%) of the studies. Springer and Wiley Online scored the second and third place with (c. 3, 4%) and (1, 1%), respectively.

B. ANALYSIS BASED ON ACTIVE COUNTRIES (RQ2)
We used the author's affiliation to rank the active countries regarding research on the measurement of obfuscation quality, i.e., to answer RQ2. The author's affiliation was used. In case of more than one author, the first author's country was selected. Figure 4 shows the ten authors' affiliation countries, as described in the primary studies. The results (for RQ2) indicate that the four most relevant countries for the primary studies, United States of America (USA), Germany, South Korea, and Singapore contributed (c. 9, 13%, 8, 12%, and 6, 9%, and 5, 7%, respectively). They were followed by authors from China, Australia, and Iran, who contributed (c. 4, 6%) each. Japanese and Indian researchers came after that with a (c. 3, 4%) share. The remaining studies were conducted in different countries with a frequency of between one and two studies, as shown in Figure 4. In contrast, as a continent, Asia leads the statistics with a (c. 30, 45%) share, followed by the United Kingdom (UK) and Europe, and USA and Canada, who contributed (c. 19, 28%) and (c. 10, 15%), respectively. Authors from Africa and Australia were the least associated continents for the selected articles with a (c. 4, 6%) share each. These results are shown in Figure 5. Over 60% of all affiliations are accounted by seven countries, therefore, the research is focused on a specific number of regions. This illustrates the need for more research on software obfuscation from different countries to investigate the effect of sociocultural differences.

C. ANALYSIS BASED ON EMPIRICAL STRATEGY AND DATASET (RQ3 & RQ3.1)
As the study is focused on empirical studies, the studies that conducted an empirical validation of the results were chosen. The results in Table 5 (for RQ3) show the distribution of articles based on the used research strategy and that experimentation was the most common strategy with (c. 51, 75%) of the studies using this strategy. The second VOLUME 9, 2021  most common methodology was simulation with (c. 16, 24%) of studies, and the least common strategy was a case study, with (c. 1, 1%) share. The total frequency is 68 studies because two different strategies, namely experimentation and simulation, were used in a single study [63]. Table 5 shows that all papers used experimentation or simulation methodologies except for one paper [27] that used a case study methodology. An explanation for the lack of case studies might be that the researcher cannot make meaningful generalizations from this methodology type because it is believed that it does not provide sufficient data to allow generalizability. Moreover, working with a case study from the real-world industry is a popular issue in internet technology research [84] due to confidentiality issues, as details about the participant's organization may be published, i.e., the obfuscation techniques used by their organization. Therefore, this methodology type would generally be high in external validity [21]. Table 6 shows the main sources of the dataset that have been adopted by researchers (RQ3.1). Column one of Table 6 lists the dataset source categories that were identified, column two and three show the frequency and percentage of as the dataset source category appeared in the primary articles, respectively. Herein, the key sources that have been identified are six, in-the-wild, manually written programs, open-source software, benchmark suite, historical data/previous projects, and web pages. The most-widely used dataset source is in-the-wild with a (c. 33, 49%) share, followed by manually written programs, open-source software, benchmark suites, and historical data/previous projects, with (c. 10, 15%, 9, 13%, 8, 12%, and 5, 7%, respectively). The web page-based dataset source was the least widely used source in the primary studies of this systematic literature review with a (c. 3, 4%) share. The last column lists examples of the corresponding source category with the study number. Again, the total frequency is 68 studies because two different dataset sources, in-the-wild and historical data/previous projects, were used in a single study [77]. The dataset ''in-thewild'' was the most commonly used dataset in the selected studies. It is continuously updated and maintained [85]; it contains many samples and diverse categories. Google Play Store, 2 VX Heavens, 3 F-Droid, 4 and VirusTotal 5 are examples of such a dataset. Most open-source software was downloaded from SourceForge 6 and GitHub 7 repositories. Most manually written programs that were used as a dataset were written using C language [22], [23], [24], [26], [28]- [30], [44], [47]- [49], [50]. C, Java, and Python are the most common programming languages. The program to generate prime numbers was used in three studies [23], [39], [44]. Table 7 shows the main categories of measurements of obfuscation strength (RQ4) that were found from our review article. Column one of Table 7 lists the quality attributes that were found in our study. Column two defines the attribute and its key features, column three shows the corresponding primary study number, and columns four and five show the frequency and percentage of attribute. Compare to the primary studies, the total frequency is more (i.e., 67) because one primary study measured more than one single quality attribute. The key quality attributes that were identified in our study are as follows: potency, resilience, cost, stealth, and similarity. The most widely used obfuscation attribute in the systematic literature review is similarity c. 36, 54%, followed by cost and potency. The number of studies that measured these two attributes is almost the same (c. 21, 31% and 20, 30%, respectively). The least widely used quality attributes in our systematic literature review are resilience and stealth (c. 9, 13% and 8, 12%, respectively).

D. CATEGORIZATION OF QUALITY ATTRIBUTES OF OBFUSCATION (RQ4)
Although software obfuscation was first developed at the end of 1990 by Collberg et al. [1], [90], potency, resilience, cost, and stealth remain the most used obfuscation quality attributes, despite the growing number of security breaches in the last decade.
Another observation is that over half of the primary studies focus on similar attributes. The reason for this high number might come from a serious concern regarding source code plagiarism in academia, i.e., to prevent any act of copying a student's source code with no formal approval. At least once, 72.5 percent of university students confessed to cheating [91]. Several worldwide academic institutes then developed source code similarity detection tools, such as Stanford University in USA, Karlsruhe Institute of Technology in Germany, the University of Sydney in Australia, 2 https://play.google.com/store 3 https://vx-underground.org/archive/VxHeaven/index.html 4 https://www.f-droid.org 5 https://www.virustotal.com/gui 6 https://sourceforge.net 7 https://github.com and the Vrije Universiteit Amsterdam in Netherlands, which developed MOSS, 8 JPlag, 9 Sherlock, 10 and SIM, 11 respectively. Several authors relied on such tools to identify malware programs from benign programs or the malware variants from known variants [30], [42], [7], [47], [43]. Most of the similarity approaches in these studies treat the program as a sequence of bytes. They typically analyze the source code structure, such as the Control Flow Graph (CFG), and contrast it to another source code to find similarities between the two codes. Although existing similarity-based studies generally operate on text strings [59], [82], some studies used more sophisticated methods to handle tokens instead of text [36], [57], [82]. Meanwhile, other similarity approaches tried to detect semantic similarity in source code using a dependency graph [3], [60]. Others attempted to eliminate a large portion of the code, which is less relevant to a similarity comparison [30]. However, complex obfuscation techniques remain difficult to handle [11]. The existing similarity approaches differ in their effectiveness at identifying the obfuscated code generated by obfuscation techniques, which is the similarity score between the non-obfuscated code and its obfuscated/plagiarized version. For this, most of these studies used performance metrics as a sub-measure to estimate the similarity attribute (RQ5 will be answered in the next section). 8 https://theory.stanford.edu/∼aiken/moss 9 https://jplag.ipd.kit.edu/ 10 https://github.com/diogocabral/sherlock 11 https://dickgrune.com/Programs/similarity_tester/ VOLUME 9, 2021

E. MEASURE-BASED ANALYSIS (RQ5)
The fifth research question (RQ5) was framed to find out the measures of obfuscation quality attributes. As shown in Table 8, the widely used measures to quantify the obfuscation potency, resilience, cost, and similarity are complexity, human effort, efficiency, and multiclass performance metrics, distance measures, and matching method, respectively. Because each measure could be used to control the considered quality attribute at different levels, we have classified them into 44 different sub-measures, as shown in column three.
Column four lists the primary study numbers, while the frequency and percentage of occurrence for each sub-measure as they appeared in the primary studies are shown in the last two columns (columns five and six). This frequency is illustrated in Figure 6. According to our extracted data, some researchers used more than one measure to quantify a single quality attribute; for example, study [26] used four complexity sub-measures to quantify the potency, and study [49] used two different sub-measures of human efforts to estimate the resilience. Therefore, the total frequency is 84 studies (i.e., greater than the number of primary studies) because of using more than one sub-measure by a single author or study.
According to our literature review, the McCabe cyclomatic number 12 and program object-oriented (OO) metrics 13 are the most common two measures for complexity evaluation, i.e., to quantify the obfuscation potency (c. 7, 10.4% and 12 The cyclomatic number is computed for the program's CFG, as e-n+2, the CFG has e edges and n nodes. It was defined by McCabe [92]. Although it was originally proposed for procedural programming languages, its adoption in OO languages has often been discussed [25]. 13 The OO metrics were proposed by Chidamber and Kemerer [93]. 4, 6%, respectively). Obfuscation tends to operate on the opposite side of the refactoring principle [94]. While refactoring generally aims to decrease the code complexity and coupling, 14 the obfuscators should propose techniques to increase both metrics (complexity and coupling). For this, different obfuscation techniques take opposite mechanisms with code to make it difficult to analyze. Those techniques then decrease the two metrics, most likely with the ultimate goal of in any other way obstructing interpretation. Class Splitter 15 is an example of such techniques [25]. 14 Coupling is the degree of inter-dependence between modules. In contrast, cohesion is the degree of intra-dependence in a single module. From a software quality perspective, low coupling and high cohesion are two signs of a good design [95]. 15 Class Splitter splits the non-obfuscated classes into obfuscated ones by inserting dummy classes. The rationale for this idea was the class complexity increases with depth of its inheritance tree [1], [96] Time and space overheads are the two most common measures for efficiency evaluation, i.e., to estimate the obfuscation cost (c. 16, 23.9% and 6, 9%, respectively). In contrast, six sub-measures are used to evaluate human effort, i.e., to quantify the obfuscation resilience (c. 1, 1.5%, each). A common factor among the human effort sub-measure is the effort by both the programmer/reverse engineering expert and attacker. However, the real effort required for reverse engineering is not easy to measure because of the varying experience and skills of the people involved (programmer or attacker). It may take some attackers longer than others to analyze the same code. In the case of similarity attributes, three different sets of measures were used in the evaluation, as follows: • Distance measures (c. 23, 34.3%). Most of these measures were entropy-based measures (e.g., Shannon entropy), followed by a cosine measure and longest common subsequence.
• Matching algorithms (c. 3, 4.5%), such as string matching. The remaining sub-measures and their frequency are shown in Table 8 and Figure 6. The reader is assumed to be familiar with these measures. Interested readers can consult the relevant studies for further information. Although several of the extracted studies [3], [24], [27], [40], [57], [58], [71], [73] mentioned the stealth quality attribute, the authors were not clear on how to measure it. While one study considered the quality of input to be an indicator of stealth quality [40], two other studies used multiclass performance metrics to measure stealth [58], [71].
The last observation about the measures is their granularity level. There are three granularity levels [95], as follows: • Fine grain: the measure works at variable and statement levels.
• Medium grain: the measure works at function or method levels.
• Coarse grain: the measure works at the program level. Herein, the results indicate that most of the measures used to quantify obfuscation quality attributes are medium and coarse-grained, with c. 24, 38% and 26, 41%, respectively. This is practical for large programs that use higher-level structures. Such programs were used as the input for the obfuscation technique in most of the studies; the finely grained measure is unsuitable to quantify attributes in large systems. However, the fine-grained measure may be appropriate when the dataset source is ''manually written programs'' (see Table 6).

F. RESEARCH DIRECTIONS (RQ6)
From this paper, there are gaps in the current literature that need more research. Such research will improve the obfuscation strength and quality. There are four directions for further research (i.e., answering to RQ6): • Developing a standard measure that can estimate or quantify multiple qualities.
• According to Table 8 and Figure 6, which show the current measures along with their frequency of occurrence, there is a lack of measures to quantify the obfuscation stealth, although there some studies claimed that they measured stealth using very general measures. The researchers were not clear in addressing this issue compared to that with other qualities. A possible reason might be that a stealthy code in one program does not mean that it can be stealthy in another program; stealth is highly context-sensitive.
• There is a need to investigate obfuscation quality through performing case studies from the industry, i.e., using real-world datasets. As shown in Table 5, there is a lack of this type of empirical strategy compared to experimentations and simulations, which use public datasets and open-source code.
• There is little research (3 out of 67 studies [23], [26], [39]) that addresses the obfuscation cost issues when adopting the parallel processing mechanism. The need for this research comes from not using OO languages. In such a case, parallel lines can be drawn with data structures; for instance, measuring data structures used by multiple functions [26].

V. THREATS TO VALIDITY AND LIMITATIONS
Like any research work, there are some limitations that may affect the results. According to [17], the quality assessment of publication in systematic reviews is still an issue. The following are two limitations related to this systematic literature review: • The extraction process may have resulted in some inaccuracies or bias. Although the chosen databases cover the relevant publication in the obfuscation technique domain, some other studies have been not included because they are published in other databases.
• Missing out some keywords in the search string might be another threat. Although all the relevant keywords have been covered, there is still a possibility that some were missed.

VI. CONCLUSION
In this study, the state-of-the-art measures for quantifying software obfuscation quality according to several attributes were summarized. Systematic literature review has been performed using 67 studies from 2010 to April 2021. After analyzing each publication in detail, we found the following: (1) the contribution of the IEEE and ACM libraries is greater than that of the other libraries in the area of research, and two venues, Journal of Computers & Security and IEEE/ACM International Workshop on Software Protection, were the most common publication venues; (2) researchers from the USA, Germany, South Korea, and Singapore were the most active; (3) the most common empirical strategy was experimentation followed by simulation, and in-the-wild datasets (e.g., Google Play) were the most widely used datasets, followed by manually written programs (e.g. programming assignments at a university, such as prime number generation and matrix multiplication); (4) the five key obfuscation quality attributes that were the most discussed in the primary studies were potency, resilience, cost, stealth, and similarity, and similarity, followed by cost and potency, was the most cited obfuscation quality attributes; and (5) the most widely used measure to quantify the potency, resilience, cost, and similarity were complexity, human effort, efficiency, and the multiclass performance metrics, distance measures, and matching methods, respectively. Because each measure could be understood from different angles, they were classified into 44 sub-measures, as shown in Table 8. The McCabe cyclomatic number and OO metrics are the key sub-measures used to estimate the complexity VOLUME 9, 2021 (i.e., potency), while time and space overheads were the key sub-measures that were found to quantify the efficiency (i.e., cost). The sub-measures that were used to evaluate the human effort were based on the time spent by the programmer and attacker. Moreover, three different sets of measures were used to evaluate the similarity, distance measures (e.g., entropy-based measures), matching algorithms (e.g., function naming matching) and multiclass performance metrics like F1-score, precision, and recall.

APPENDIX A
See Table 9.