Deoptfuscator: Defeating Advanced Control-flow Obfuscation Using Android Runtime (ART)

Code obfuscation is a technique that makes it difficult for code analyzers to understand a program by transforming its structures or operations while maintaining its original functionality. Android app developers often employ obfuscation techniques to protect business logic and core algorithm inside their app against reverse engineering attacks. On the other hand, malicious app writers also use obfuscation techniques to avoid being detected by anti-malware software. If malware analysts can mitigate the code obfuscation applied to malicious apps, they can analyze and detect the malicious apps more efficiently. This paper proposes a new tool, Deoptfuscator, to detect obfuscated an Android app and to restore the original source codes. Deoptfuscator detects an app control-flow obfuscated by DexGuard and tries to restore the original control-flows. Deoptfuscator deobfuscates in two steps: it determines whether an control-flow obfuscation technique is applied and then deobfuscates the obfuscated codes. Through experiments, we analyze how similar a deobfuscated app is to the original one and show that the obfuscated app can be effectively restored to the one similar to the original. We also show that the deobfuscated apps run normally.

showed that obfuscated malicious apps could evade antimalware systems. They made their own tool to obfuscate Android apps, obfuscated Android malicious apps, and uploaded the apps to VirulTotal system [18] to check if the apps could be accurately classified as malicious. The test results showed that the performance of detecting obfuscated malicious apps was significantly lower than that of detecting the original malicious apps to which obfuscation was not applied. Therefore, in order to effectively detect an obfuscated malicious app, it is necessary to deobfuscate the obfuscated malicious app.
There are several forms of obfuscation techniques for Android apps: identifier renaming, control-flow obfuscation, string encryption, class encryption, API hiding (Java reflection), etc. We focus on control-flow obfuscation and its deobfuscation in this paper.
We implement a new Android deobfuscation tool, Deoptfuscator, to determine whether the control-flow of an Android app is obfuscated by DexGuard, and then to deobfuscate the control-flow obfuscated apps. We also evaluate the performance of Deoptfuscator with respect to ReDex.
Among various issues related to Android deobfuscation techniques, we try to answer the following three research questions.
RQ1 How to detect and determine whether the controlflow of a given Android app is obfuscated or not? RQ2 How to effectively deobfuscate a control-flow obfuscated app? RQ3 How can we confirm that our deobfuscation approach was really successful? In summary, the main contributions of this paper are the following: • Deoptfuscator is the first tool for Android apps to detect and deobfuscate high-level control-flow obfuscation patterns of DexGuard. • The effectiveness of Deobfuscator is demonstrated by checking whether the deobfuscated app also runs the same as the original app which the control-flow obfuscation was not applied • The source code of Deopfuscator has been published in a public repository on GitHub. Thus, it can be freely accessed and used by anyone [19], [20]. Our paper is organized as follows: Section 2 describes the characteristics and patterns of control-flow obfuscation and ART(Android Runtime) in Android. Section 3 explains the design and implementation of Deoptfuscator, and its deobfuscation strategy. Section 4 presents the experimental method to evaluate Deopfuscator, and section 5 evaluates its performance. Section 6 describes the related studies and discusses the limitation of our study. Finally, section 7 concludes this work.

II. BACKGROUND
Obfuscation is a technique that increases the time and cost required for program analysis while keeping the program's functionality. Suppose an original program P is transformed (obfuscated) to P ′ using a transformation technique T (P T → P ′ ). Then, the functionality of P and P ′ are the same, but the analysis complexity of P ′ is much higher than P [10]- [17], [21]- [24]. Popular obfuscation tools for Android apps include R8 [25], a compiler suite that incorporates Pro-Guard's [26] obfuscation functions, DashO [27], DexProtector [28], and DexGuard [29].
Obfuscation techniques can be classified into four types as follows.
• Identifier renaming changes the name of the identifiers such as package, class, method and variable to meaningless symbols • String encryption encrypts and stores string literals, and decrypts them at runtime, restoring the original strings. • Control-flow obfuscation changes the control-flow of a program by inserting dummy codes or exception handling codes (try-catch phrase), modifying branch/condition statements, etc. • Reflection obfuscation hides the name of invoked methods using Java reflection (a.k.a. API hiding).

A. CONTROL-FLOW OBFUSCATION
Control-flow obfuscation is a technique that hinders efficient program analysis by inserting dummy codes and exception handling codes, or modifying branch/condition statements, consequently complicating the order of code execution or function invocation. However, control-flow obfuscated codes can be simplified or removed by modern compilers. Recent compilers, such as R8 compiler, are equipped with excellent optimization techniques that can remove unnecessary codes [15]. Opaque predicates and opaque variables are useful in effective control-flow obfuscation. An opaque predicate is a conditional expression that is composed of complex operations, so that it is difficult to tell whether the result of the expression is true or false. The result of opaque predicate becomes known at runtime. An opaque variable is a variable used in opaque predicates [21]- [24], [30]- [35].
The usage pattern of the opaque variable and opaque predicate can divide the code obfuscation into three levels. The higher the level, the harder it is for the optimization tool to remove the obfuscated code. 1) LEVEL 1 Level 1 control-flow obfuscation has the following form.
• Opaque variables are declared as local variables.
• Opaque predicates test whether a opaque variable is identical to a constant. Fig. 1 shows an example of level 1 control-flow obfuscation. Fig. 1(a) is the obfuscated source codes, where 'a' and 'b' are opaque variables and 'b == 1' and 'a == 2' are opaque predicates. Since the conditional expressions at line 8 and 9 are always false, the compiler removes the conditional statement and the two local variables while the 2 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ functionality of the method 'Obfuscation_1()' is not changed. Fig. 1(b) is the Dalvik bytecodes compiled from the source codes by R8 compiler. It shows that the method 'Obfuscation_1()' does nothing and returns immediately.
(a) Level 1 control-flow obfuscated Java source codes (b) Bytecodes compiled from the Java source codes FIGURE 1: Example of level 1 control-flow obfuscation. The opaque variables and opaque predicates can be removed by R8 2) LEVEL 2 Level 2 control-flow obfuscation has the following form.
• Opaque variables are declared as local variables.
• Opaque predicates consist of mathematical operations (e.g. positive/negative decision, odd/even decision, ...) Fig. 2 shows an example of level 2 control-flow obfuscation. It differs from level 1 control-flow obfuscation in that the two opaque predicates employ modulo operations ('b % 128 == 1' and 'a % 64 == 0') instead of just comparing opaque variables with a constant. Again, since the opaque predicates at line 8 and 9 are always false, a compiler produces bytecodes that do nothing and just return if it optimizes the codes perfectly. However, when the source code ( Fig. 2(a)) is compiled by R8 compiler with the default options, the produced bytecodes contain the logics for the opaque predicates ( Fig. 2(b)). ReDex, an Android app optimization tool, can remove the local opaque variables and the simple opaque predicates. Fig. 2(c) is the bytecodes produced by ReDex. The resulting bytecodes do nothing and return immediately.
3) LEVEL 3 (Advanced Control-flow Obfuscation) Level 3 control-flow obfuscation has the following form.
• Opaque variables are declared as global variables.
• Opaque predicates consist of mathematical operations (e.g. positive/negative decision, odd/even decision, ...)  Level 3 control-flow obfuscation is also called as advanced control-flow obfuscation. Even optimizers of recent compilers cannot easily optimize level 3 obfuscation. Fig. 3 shows an example of level 3 control-flow obfuscation. In Fig. 3(a), the opaque variables ('g_a' and 'g_b') are global within class test. Although the opaque predicates at line 8 and 9 are always false, neither R8 compiler nor ReDex removes the opaque variables and the opaque predicates. Fig. 3(b) and Fig. 3(c) show the optimized Dalvik bytecodes optimized by R8 compiler and ReDex, repectively. In this example, the two Dalvik bytecodes produced by R8 compiler and ReDex are exactly the same.
ReDex does not remove global opaque variables. Since a global variable may be used in several methods, ReDex regard global variables as non-opaque variables. To deobfuscate level 3 control-flow obfuscated codes, we should remove global opaque variables. If a global variable is used only in a method and opaque predicates, the global variable and predicates can safely be removed.

B. ANDROID RUNTIME (ART)
Ahead-of-Time (AOT) compilation statically translates codes before an execution of an app, while Just-in-Time (JIT) compilation dynamically translates codes during runtime [36]. AOT converts all codes to machine code at installation time, so app installation speed is slow compared to JIT. JIT converts frequently used bytecodes to machine code during runtime and app installation time is fast compared to AOT. VOLUME 4, 2016 3 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   From Android 5.0 (Lollipop), DVM was replaced with Android runtime (ART). In Lollipop, the AOT compiler compiles an app to native code on installation and reduces the runtime overhead associated with the JIT compiler [36]- [39]. This approach incurred longer app installation and compilation times. Users experience the latency on every app installation or over-the-air (OTA) software updates. The size of the app alseo increases by 1.5 times on average, and up to 2 times compared to JIT compilation. The advantages and disadvantages of DVM and ART are described in Table 1.
ART has been more improved to take the advantages of both JIT and AOT as well as to provide flexibility for app development and execution. Android 7.0 has adopted a combination of interpretation, JIT compilation, and AOT compilation [36], [37]. An app starts off being JIT compiled, and then frequently invoked methods are AOT compiled to native code based on profiling data from the app's execution. Modern ART includes a JIT compiler with code profiling . The JIT compiler complements ART's AOT compiler and reduce storage space, and speeds app updates. ART also improves the AOT compiler by avoiding recompilation of apps during over-the-air (OTA) updates or system slowdown during automatic app updates. ART's dex2oat is an on-device compiler suite with several compilation backends, code generators for hardware platforms, etc. It is responsible for the validation of apps and their compilation to native code [32]. Fig. 4 shows the compilation process of the dex2oat compiler using optimizing backend. When a .dex file in an APK is given as input to the dex2oat, it checks the validity of the input file (.dex). Then, the code in the .dex is converted into an .oat file through Hydrogen Intermediate Representation (HIR). The .oat file is the AOT binary for the .dex file. The HIR, also called optimizing's intermediate representation (IR), is a controlflow graph (CFG) on the method level which is denoted as HGraph. The HGraph is used as the single IR of the app code. When the HGraph is created, the dex instructions of the app's bytecode are examined one after another, and the corresponding HInstructions are generated and interconnected with the current basic block and the graph. It is transformed into single static assignment form(SSA) for complex optimizations.
Typical optimizations using HGraph are as follows [43]- [47]: By modifying the optimization part of ART, we develop three modules: Opaque identification module to identify opaque variables, Opaque location module to record the location of the identified opaque variables, and Opaque clinit module to remove the opaque variables. The detailed description of these modules are given in Section III

C. DEXGUARD'S CONTROL-FLOW OBFUSCATION
The Android tool DexGuard provides obfuscation equivalent to Advanced control-flow obfuscation (level 3). This section describes the advanced control-flow obfuscation (level 3) used by DexGuard. Fig. 5 shows the transformation of Java source code when control-flow obfuscation is applied using DexGuard. The original code ( Fig. 5(a)) is a simple onCreate() method without any operation instructions or branch/conditional statements, but the obfuscated code ( Fig. 5(b) contains several operation instructions and branch/conditional statements are inserted.  The variable i is used as part of the opaque predicate in the conditional expression of the if statement. Similar codes exist before the next switch-case statement. Through code analysis such as this, we can find that f66 affects f65 through simple arithmetic operations and local variable (i). This shows that f65 and f66 are global opaque variables and used in pairs. Also, it can be confirmed that the local variable i is an important variable that determines the true/false of the opaque predicate in the conditional expression of branch/conditional statements (if, switch-case). DexGuard's control-flow obfuscation uses these patterns.

D. REDEX OPTIMIZER
Our proposed tool, Deoptfuscator detects the DexGuard's control-flow obfuscation patterns described in Section II.C, lowers their obfuscation level to Level 2 from Level 3, and then optimizes them using ReDex. ReDex is an Android bytecode optimizer developed by Facebook Engineering team, which was released as open source [12], [15], [48], [49]. It takes a dex file as input and outputs the dex file with optimized bytecode. ReDex uses several modules to optimize dex files. Of them, we are interested in the followings: Inlining is the process of replacing a function call at the point of call with the body of the function being called, thus reduces the overhead of a function call. DCE walks all branches and method invocations from the entry points of an app and removes any code that is unreachable. Peephole optimization involves replacing a small code patterns with an equivalent pattern that performs better. It performs a string search of the code for known inefficient sequences and replaces them with more efficient code. It can remove redundant load/store instructions and perform algebraic simplification, etc. Each module can be processed independently of each other.
Based on analyzing the characteristics of the optimization modules, we find out that ReDex can effectively remove the control-flow obfuscation of Level 2 defined in Section II-A, while it cannot handle the advanced control-flow obfuscation (Level 3) directly.

III. DESIGN OF DEOPTFUSCATOR
We propose Deoptfuscator, a tool that can deobfuscate Android apps. It can deobfuscate advanced control-flow obfuscation. It can be used alone in a user's PC or as a part of ART compilation process. Deoptfuscator consists of three modules: The Opaque identification module detects global opaque variables. The Opaque location module records VOLUME 4, 2016 the location of opaque variables detected by the Opaque identification module. The Opaque clinit module changes the property of opaque variables appropriately.

Fig. 6 depicts the deobfuscation steps of Deoptfuscator.
Deoptfuscator proceeds in the following order.
1) Unpackaging Given a control-flow obfuscated APK, it unpackages the input APK using APKTool. 2) Detecting opaque variables Using the Opaque identification module, it identifies the opaque variables. 3) Profiling detected opaque variables Using the Opaque location module, the locations of opaque variables detected in step 2 are recorded in json format. 4) Lowering obfuscation level Change the global opaque variables recoreded in step 3 to local opaque variables, which means that the obfuscation level is lowered from level 3 to level 2. 5) Optimizing DEX Using ReDex, remove local opaque variables and opaque predicates. 6) Repackaging Repackage the DEX file. The resulting APK is control-flow deobfuscated.

B. OPAQUE IDENTIFICATION
This section describes the process of the Opaque identification module of Deoptfuscator in detail using an example. Fig. 7 shows a part of method onCreate() which is control-flow obfuscated using DexGuard ( Fig. 5(b)). In Fig. 7, f65 and f66 are global opaque variables, and i is a local variable used as a bridge between f66 and f65 and between opaque variables and opaque predicates. Variable i is also used in a conditional expression (an opaque predicate). Using a local variable as a bridge between global opaque variables and opaque predicates increases the program complexity and prevents compilers or optimizers from removing control-flow obfuscation. Fig. 8 shows the HIR for the code snippets in Fig. 7. Deoptfuscator utilizes this HIR to analyze the variable usage pattern, remove global opaque variables effectively and simplify the control-flow. In Fig. 8, 'pred' and 'succ' indicate the basic block numbers before and after the current basic block. BasicBlock 0 is the first basic block of method onCreate(), so there is no previous block and the subsequent block number is 1. BasicBlock 1 indicates that the previous block number is 0, and can branch to block 9 or 10. The label of each instruction denotes the return data type of the instruction and the execution order in a method. Alphabet 'j', 'l', 'i', 'v', and 'z' stand for 'Java long', 'Java reference', 'Java int', 'Java void', and 'Java boolean', respectively. For example, 'i9: StaticFieldGet [l8]' means that this instruction gets a Java int variable from the field area of the class referred by l8. Fig. 9 shows the HIR instructions converted from obfuscated Java source codes in the example. We explain each Java statement (S1 ∼ S4) and its corresponding HIR instructions in a DexGuard's obfuscation pattern.

S1
Get a reference to the class (l8) that contains the current method (j7), and get the class variable f66 of the class (i9).

S2
Add f66 obtained from i9 and constant 125 (i10), and store the result in local variable i (i11).

S4
Perform modulo operation by dividing i (i11) by constant 2 (i16), and the result (i17) is compared with constant 0 (i18) by NotEqual operation (z19). The result of the NotEqual operation is used as the conditional expression of If operation (v20).
Deoptfuscator analyzes the variable usage pattern based on HIR to detect global opaque variables. Fig. 10 shows the internal representation for the HIR given in Fig. 8. The analysis is performed as follows. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.    Fig. 7 erations is stored to a global variable via instruction StaticFieldSet(v15). Deoptfuscator records the locations of v15 and i13. Thereby, i9 and i13 become a pair of global opaque variable candidates.  Fig. 11 displays the constants (green), global variables (red), and a local variable (blue) of the Java source, as well as their location in the HIR. We can see that the global variables and the local variable are used only in the obfuscation patterns. We can confirm that i9 and i13 are global opaque variables. The information for the confirmed global opaque variables is stored in a temporary file for each method. The above process is repeated for each method in a class. Note that opaque variables can be used in many methods in a class.

C. OPAQUE LOCATION
The Opaque location module collects the temporary files containing the information for the confirmed opaque variables and records the information in json format. Specifically, the information includes class name, method name, the field indexes of global opaque variables and the locations of instruction (sget and sput) that accesses global opaque variables. The instruction locations are the distance from the method's offset in DEX file and can be calculated using the location of StaticFieldGet and StaticFieldSet in the HIR.

D. OPAQUE CLINIT
The Opaque clinit module removes the detected advanced control-flow obfuscation by lowering the obfuscation level. To decide whether to remove the detected control-flow obfuscation from a class, we measure the ratio of bytecodes matching the obfuscation pattern to the entire bytecodes of a class. We call this ratio Obfuscated Bytecode Ratio (OBR) and is defined as follows: The series of instructions from l8 to v20 in Fig. 10 is a control-flow obfuscation pattern in HIR. Its corresponding pattern in bytecodes is a series of instructions from 'sget' to 'if-nez' in Fig. 12. For example, consider a class C with two methods m1 and m2. Assume Deoptfuscator detected one obfuscation pattern in m1 and two in m2, and that the obfuscation patterns are the same as Fig. 12 (from 'sget' to 'if-nez'). Since the length of a bytecode instruction is 4 bytes, the length of an obfuscation pattern is 24 bytes. Thus, the total length of the obfuscation patterns detected in class C is N m1 + N m2 = 24 + 2 × 24 = 72 bytes.
L m is the length of bytecodes of method m and can be obtained from DEX file. Among the items of DEX file, there are insns and insns_size fields in the code item area. insns is an array containing the bytecode of a method, and insns_size indicates the length of insns. In other words, insns_size is the total length of the bytecode of a method. Let insns_size of method m1 and m2 of class C be 100 and 200, respectively. Then L m1 + L m2 = 100 + 200 = 300.
A high OBR implies that obfuscation patterns are found 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  in a class many times. Such a class is likely to be controlflow obfuscated since obfuscators tend to insert obfuscation patterns into a class many times. If the OBR of a class is higher than a threshold θ, Deoptfuscator regards the class as obfuscated and deobfuscates it. Otherwise, the detected obfuscation pattern, if any, is regarded as false positive. The threshold θ is selected empirically. Using the threshold, we can control how aggressively we deobfuscate classes. As the threshold decreases, the number of classes to which deobfuscation is applied increases (aggressive deobfuscation). As the threshold increases, the number of classes to which deobfuscation is applied decreases (passive deobfuscation). If the threshold is 0, Deoptfuscator deobfuscates all classes. For example, assuming θ = 0.15, the OBR of class C above is calculated as follows and Deoptfuscator deobfuscates C. For a class with OBR > θ, Deoptfuscator lowers its controlflow obfuscation level from 3 to 2 by converting global opaque variables to local variables. These global variables are defined in method clinit. The Opaque clinit changes the instruction to read a global variable (sget) and the instruction to write a value to the global variable ('sput') to the instruction to assign or get a value of a local variable ('const/16'). Then, the Opaque clinit module removes the codes that declare the global opaque variable pairs. Since there is no place where global opaque variables are used in the class through the previous processes, removing them does not cause errors.

E. OPTIMIZING DEX
Deoptfuscator optimizes the modified bytecodes (DEX file) utilizing ReDex. As explained in Section II-A, ReDex can remove level 2 control-flow obfuscation. Fig. 13 shows the Java code decompiled from the deobfuscated version of onCreate() of Fig. 5(b). You can see that the code of the method has been restored to the same as the original (Fig. 5(a)).

IV. EXPERIMENTAL SETUP
FIGURE 13: The deobfuscated method by Deoptfuscator, which is in Fig. 5(b)

A. DATASET FOR EVALUATION
We used the Android apps that F-Droid project collected in our experiment -we will call them original apps in this paper. Using one original app, we generated two more apps by applying control-flow obfuscation and optimization of DexGuard. We created two obfuscated apps for each original app. One was applied the high-level obfuscation/optimization option and the other applied the moderate-level obfuscation/optimization option. We select 63 original apps that all three versions of apps run normally on AVD (Android Virtual Device) and an actual smartphone (Pixel 2 XL with Android Oreo 9.0).

B. EXPERIMENTAL METHOD
As described above, 63 highly obfuscated apps and 63 moderately obfuscated apps were generated from 63 original apps. Theses apps were deobfuscated using our proposed Deoptfuscator. In order to determine how well the Deoptfuscator perform, the deobfuscated apps were compared with ones optimized using the ReDex optimization tool. The optimizer does not aim to deobfuscate apps, but it can be considered as minimal deobfuscation in that it eliminates meaningless or unnecessary comparisons and loops. We created 63 optimized apps for highly obfuscated apps and moderately obfuscated ones, respectively.
The Deoptfuscator's performance depends on the threshold of Eq. 1. Therefore, we deobfuscated apps by changing this threshold to 0.015, 0. 075, 0.15, and 0.225. A total of 504 apps were created, two sets of 252 (63×4) apps for highly and moderately obfuscated apps. Therefore, the list and the number of apps used in our experiment are as follows (Fig. 14).  , we measured the similarity between the deobfuscated app and the original app. Androsim expresses the bytecode extracted from the Android app as a string and compares the similarity between the two apps on a method-by-method basis.

V. PERFORMANCE EVALUATION
Before analyzing the experimental results, we need to learn the difference between DexGuard and ProGuard. First, Pro-Guard is an optimization-focused tool that removes unnecessary codes from an app. Of course, ProGuard also provides identifier renaming obfuscation, which renames identifiers such as classes, methods, and variables to meaningless shorter ASCII names. However, ProGuard's identifier renaming aims to minimize the storage space by shortening identifier names. That is, ProGuard's identifier renaming obfuscates for optimization. Second, DexGuard is a commercial tool based on ProGuard with some obfuscation and app protection technologies, including control-flow obfuscation. DexGuard's identifier renaming differs from ProGuard's identifier renaming in that names are changed to a short form of special characters (non-ASCII). Because DexGuard performs optimization by default, an obfuscated apps have fewer methods compared to the original app [15], [25], [29], [36]. Control-flow graph examples that analyze the same method for each app can be found in APPENDIX.
To measure the deobfuscation ability of the proposed tool, we analyzed the characteristics of the apps according to the criteria mentioned above. The numerical value presented is a normalized value based on the value of the original app. Fig. 15 and Table 2 show the results of measurements for highly obfuscated apps. In the legend of the figure and table, 'original' is the original apps, 'DexGuard' is the obfuscated apps, 'ReDex' is the apps optimized by ReDex. 'θ=n' represents the apps deobfuscated by setting the threshold to n.

A. HIGHLY OBFUSCATED APPS
Looking at the number of methods in the highly obfuscated app, we can see that it has decreased by about 30% compared to the original app. In contrast, the number of basic blocks and the number of CFG edges increase significantly by 8.23 times and 12.8 times, respectively. Naturally, it can be seen that the number of basic blocks and edges for each method also increases 11.5 times and 17.8 times, respectively, and the value of insns, which represents the number of bytecode instructions, also increases more than 4 times. Despite the significant increase in the number of basic blocks and instructions, related to executable code, the reason why the size of the dex file has increased by about 43% is due to optimizations such as identifier renaming and unnecessary method removal.
Let's see the result of optimizing the obfuscated app with ReDex. First, if you look at the change in the number of methods, you can see that there is little difference because DexGuard removes unused methods along with obfuscation. The number of basic blocks and CFG edges is about 1.57 times and 3.6 times that of the original app, which correspond to about 19% and 28% of the highly obfuscated app. The number of basic blocks and edges per method shows a similar trend. It can be confirmed that the length of the bytecode is also about 50% of the obfuscated one.
When deobfuscating with Deoptfuscator, the larger the threshold, the fewer classes to which deobfuscation is applied, and the smaller the threshold, the more it increases. When the number of classes to which deobfuscation is applied is small, most classes are only optimized by ReDex, so the results of deobfuscation and optimization show a similar result. As an example, you can find the result of deobfuscation with θ=0.225 is almost similar to that of optimizing with ReDex. When the other three thresholds were set, the number of basic blocks was 1.15 times the original, and the number of edges was about twice. The length of the bytecode was about 1.06 times, which was almost the same size as the original. As shown in Fig. 15, if the threshold is greater than 0.15, the effect of intrinsic deobfuscation almost disappears. Fig. 16 and Table 3 show the experimental results for an app that is moderately obfuscated. With moderately obfuscated apps, the results show the same tendency as highly obfuscated apps, but the numbers are small because DexGuard applies the same optimization but a subset of obfuscation.

B. MODERATELY OBFUSCATED APPS
The number of methods yielded almost the same result as for highly obfuscated apps. In other words, it can be confirmed once again that the decrease in the number of methods is a result of DexGuard's optimization. The number of basic blocks and CFG edges increased to about 3.78 times and 5.19 times of the original, respectively. The number of basic blocks and CFG edges per method also increased. The length of bytecode increased by about 69%, but the size of the .dex file was reduced to about 76% due to optimization. VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   ReDex optimization reduces the number of basic blocks and CFG edges to about 28% and 30% of the moderately obfuscated app, which correspond to about 1.04 times and 1.58 times that of the original app. The number of basic blocks and edges per method shows a similar trend. It can be confirmed that the length of the bytecode is also about 57% of the obfuscated one.
The deobfuscated app has 0.84 times the original basic blocks, and 1.12 times the number of edges. The length of the bytecode was about 1.06 times, which was almost the same size as the original.

C. THE DEGREE OF SIMILARITY
How similar the deobfuscated app is to the original app will best indicate the effectiveness of a deobfuscation tool. Since control-flow obfuscation is performed on a methodby-method basis, it is reasonable to measure similarity on a method-by-method basis. We use Androguard's Androsim module to calculate the similarity between an original app, one obfuscated with DexGuard, one optimized with ReDex, and one deobfuscated with Deoptfuscator.
As described above, the number of methods in the obfuscated app is about 68% of that of the original one, so the expected similarity to the original is about 68%. In addition, the similarity will be lower because methods that are not obfuscated can be modified by optimization.   17 shows similarity for highly obfuscated apps. The average similarity of highly obfuscated apps is about 19%, and the average similarity of apps optimized with ReDex is about 26%. It can be said that the similarity increased because the optimization tool can remove some obfuscated codes. Looking at the similarity with the app deobfuscated with Deoptfuscator, the larger the threshold, the less the number of methods to which deobfuscation is applied, which is closer to the ReDex result.

A. RELATED WORK
Piao et al. [37] first inspected both the weakness and the obfuscation process of DexGuard. For an app obfuscated by DexGuard, they could (1) rename classes of a DEX to deobfuscate the identifier renaming technique by analyzing the renaming dictionary of DexGuard and using dex2jar, 12 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181373 (2) restore the original strings of encrypted ones by analyzing string encryption and decryption processes, (3) obtain the disassembled smali code logic that is the same as original source code logic by breaking class encryption feature, and (4) remove the tamper detection routines or skip it and remake a fake app. They mentioned that the hardest parts of analyzing the weakness of DexGuard was to remove the opaque predicates or understand the reordered opcodes generated by control-flow randomization. Finally, they presented a server-based obfuscation technique to securely protect the encrypted classes and the tamper detection routine.
Simplify [51] uses a virtual machine sandbox for executing an app to understand its behavior. Simplify analyzes the execution graphs from the virtual machine sandbox and applies optimizations such as constant propagation, dummy code removal, reflection removal, etc. If these optimizations are applied together repeatedly, they will decrypt strings, remove reflection, and simplify code that is easier for humans to understand. Simplify does not rename identifiers.
We conducted some experiments with Simplify for the obfuscated apps used in this paper. As an output, Simplify produced only 9 deobfuscated apps of the 63 DexGuard_High apps and 16 deobfuscated apps of the 63 DexGuard_Moderate apps, and an error occurred during deobfuscation for the remaining apps. Since DexGuard changed the identifier name to special characters, it seems that the error occurred due to the failure of the string processing. In the case of the 25 apps deobfuscated by Simplify, the control-flow obfuscation of DexGuard could not be handled, and all of them did not run in our experimental environment (AVD and real smartphone). Therefore, we do not include these experiment results in the performance evaluation.
DeGuard [52] and Anti-ProGuard [53] are deobfuscation tools which targets identifier renaming obfuscation applied by ProGuard. Bichsel et al. [52] developed DeGuard, a statistical deobfuscation tool for Android which could reverse the layout obfuscation performed by ProGuard and rename obfuscated program elements of Android malware. Their approach phrases the problem of predicting identifier names renamed by the layout obfuscation as a structured prediction with probabilistic graphical models for identifiers that are based on the occurrence of names. DeGuard predicted 79.1% of the obfuscated program elements for open-source Android apps obfuscated by ProGuard and Android malware samples.
Anti-ProGuard [53] also aims to deobfuscate the identifier renaming technique. It requires smali files as input, and then uses a database storing obfuscated snippets and their original counterparts. Anti-ProGuard employs similarity hashing not pursuing exact matches for accuracy improvement. It could successfully identify over 50% of known packages in Android apps.
Java-Deobfuscator [54] is a tool that deobfuscates obfuscated Java bytecodes and makes them much readable. It can handle Java bytecodes (JAR) obfuscated by commercially available Java obfuscators such as Zelix, Klassmaster, Stringer, Allatori, DashO, DexGuard, etc. Since Java-Deobfuscator is not a tool for Android apps, several processes are required to use it for Android apps. That is, it is necessary to (1) convert the obfuscated DEX file of a given Android app into a JAR file, (2) apply Java-Deobfuscator to the JAR file, and then (3) convert the deobfuscated JAR file into a DEX file again. However, since there is a loss in the process of converting the obfuscated DEX file into a JAR file, it is difficult to expect Java-Deobfuscator to work properly, and it is very hard to correctly create and run an Android app with the finally deobfuscated DEX file.
Moses and Mordekhay [55] utilized both static and dynamic analysis to defeat two obfuscation techniques: string encryption and dynamic method binding via reflection. Their deobfuscation solution was tested on 586 Android apps, containing strings encrypted by DashO obfuscator. They identified decryption calls and extracted argument values, executed the decryption calls, and obtained the decryption results. They found out that the argument values were retrieved for 99% of the decryption calls on average. They mentioned that it is necessary to handle string encryption even in case that the decryption logic is not included in a single function for further research.
De Vos and Pouwelse [56] proposed a string deobfuscator, ASTANA, to identify the deobfuscation logic for each string literal and execute the logic to recover the original string values from obfuscated string literals in Android apps. ASTANA uses program slicing to seek for an executable code snippet with proper statements to handle a obfuscated strings.
According to the study of Wong and Lie [47], languagebased and full-native code obfuscation techniques include reflection, value encryption, dynamic loading, native methods, and full-native code obfuscation. In addition to the traditional obfuscations, Wong and Lie [47] described a set of runtime-based obfuscations in ART such as DEX file hooking, class data overwriting, ArtMethod hooking, etc. They then developed a hybrid iterative deobfuscator, TIRO (Target-Instrument-Run-Observe), which is a framework to deobfuscate malicious Android apps. TIRO employed both static instrumentation and dynamic information gathering, and could reverse language-based and runtime-based obfucation techniques.
In our previous work, we analyzed the performance of tools for obfuscating, deobfuscating, and optimizing Android apps [15]. We chose R8 compiler and Obfuscapk for obfuscators, DeGuard for a deobfuscator, and R8 compiler and Re-Dex for optimizers. As the default compiler for Android apps, R8 has various features including optimization (removing unused codes, inlining) and obfuscation (identifier renaming). We examined the characteristics of the four tools and compare their performance. R8 showed better performance than ReDex in terms of the number of classes, methods, and resources.
An Android app can contains native code binaries written in C or C++. Thus, there was a study to deobfuscate Android native binary code rather than the Android Dalvik bytecode. Kan et al. [57] proposed an automated system to deobfuscate native binary code of an Android app obfuscated by Obfuscator-LLVM (O-LLVM). O-LLVM is a popular native code obfuscator which provides three obfuscations: instruction substitution, bogus control-flow and control-flow flattening. Kan et al. could recover the original control-flow graph of native binary code using taint analysis and flowsensitive symbolic execution. For example, they used taint analysis for global opaque predicate matching to remove dead branches.
On the one hand, Ming et al. [34] tried to detect obfuscation techniques based on opaque predicates. Pointing out that existing researches were not sufficient to detect opaque predicates in terms of generality, accuracy, and obfuscationresilience, They suggested a Logic Oriented Opaque Predicate (LOOP) detection tool for obfuscated binary code, which developed based on symbolic execution and theorem proving techniques. Their approach captured the intrinsic semantics of opaque predicates with formal logic, and could even detect intermediate contextual and dynamic opaque predicates.

B. DISCUSSION
In our previous work [15], we compared optimizers and deobfuscators for Android apps, and evaluated their performance. Program optimization is a technique aimed at improving program execution speed by reducing the use of resources as well as by eliminating redundant instructions, unnecessary branches, and null-checks. On the other hand, program deobfuscation focuses on removing or mitigating the obfuscation techniques applied to the program and restore the obfuscated codes to the same or similar states as the original. Traditional control-flow obfuscation contains call indirection by substituting existing methods and adding new methods, junk-code insertion (insertion of useless computations), abuse of goto instructions, etc. Thus, deobfuscating control-flow obfuscated codes seems similar to optimization because it may also improve program execution performance. However there is a difference in that its key purpose is to restore the control-flow obfuscated app to the original.
In this paper, we devised a new approach to deobfuscating control-flow obfuscated Android apps, and verified its effectiveness based on various evaluations and similarity measurements. In addition, our approach is flexible and scalable because it allows users to determine whether to apply aggressive or passive debofuscation techniques after calculating the proportion of patterns identified that controlflow obfuscation are applied among instructions within one class through OBR.
Our work has some limitations. The proposed technique can only handle control-flow obfuscation by DexGuard, and does not consider control-flow obfuscation by other obfuscators including DashO and Allatori. If a developer accidentally writes an app includes the control-flow obfuscation patterns of DexGuard, a problem may arise if Deoptfuscator removes the global variables to deobfuscate the app. To prevent this problem, we divided the detected opaque variable into a candidate and a confirmation stage by checking whether the opaque variable was used in a part other than the obfuscation pattern through data-flow analysis. Separately, we checked whether any of many benign apps contains the code obfuscated with the control-flow obfuscation technique of DexGuard, but there was no such app.
All apps that have been deobfuscated by Deoptfuscator are executable on both a AVD and a real smartphone. The research on apps with an anti-tampering protection is out of the scope of this paper. Thus, if an obfuscated app is equipped with an integrity protection mechanism, the execution of its deobfuscated app cannot be guaranteed because the code has been changed due to the debofuscation.

VII. CONCLUSION
We defined the three levels of control-flow obfuscation according to the usage patterns of opaque variables and the type of opaque predicates used in Android apps. DexGuard, a powerful obfuscation tool for Android, offers the level 3 (advanced control-flow obfuscation) obfuscation, which uses global variables as opaque variables. Existing deobfuscators or optimizers have a difficulty of removing the level 3 obfuscation codes because if the global variables are arbitrarily removed from the obfuscated app, a fatal error may occur during execution.
We have then developed Deoptfuscator that can effectively detect and deobfuscate the codes added by the control-flow obfuscation of DexGuard. The Deoptfuscator analyzes variable usage patterns to confirm global opaque variables are used only in opaque predicates. We evaluated its performance with respect to ReDex and demonstrated the effectiveness by showing that the apps deobfuscated by Deoptfuscator can run normally on both a real device and the AVD. We have published the source code of Deoptfuscator at the public repository GitHub, which helps malware analysts to reverse control-flow obfuscated malicious Android apps. . Fig. 19 shows four control-flow graphs: the graph of an original code (a), the graph of the code obfuscated from the original with DexGuard (b), the graph of the code optimized from the obfuscated with ReDex (c), and the graph of the code deobfuscated from the obfuscated with Deoptfuscator. The name of apk and method is 'An.stop_9.apk' and 'An.stop.SettingsActivity.onCreate()', respectively. Four control-flow graphs are the same method, but the name of the package and class has been changed due to DexGuard's identifier renaming.    17 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181373