Fast Witness Generation for Readable GUI Test Scenarios via Generalized Experience Replay

Verifying the functional behavior of graphical user interface (GUI) applications is essential for reducing post-release issues. In practice, a developer/tester performs this verification by executing a sequence of GUI actions and then witnessing the expected behavior on the GUI screen. An automated witness generator facilitates the verification process. However, creating an unambiguous and monitorable specification for the correct behavior and then generating the correct GUI actions to trigger that behavior is challenging. In this study, we propose FARLEAD2, an automated witness generator that uses unambiguous, monitorable, and easy-to-read staged test scenarios (STSs) to specify expected behavior. FARLEAD2 maximizes its effectiveness and performance using generalized experienced replay (GER) to exploit the experience gathered from previously witnessed scenarios on new, unwitnessed test scenarios. To the best of our knowledge, STS and GER are novel improvements to GUI testing. Our evaluation of Android GUI applications shows that FARLEAD2 effectively generates a witness 95.7 times out of 100 and does it in 520 seconds, on average, indicating that FARLEAD2 is approximately 65 percent faster and 6.3 percent more effective than its best predecessor.


I. INTRODUCTION
Graphical user interface (GUI) applications play a huge role in mobile devices. Statistics show that between 2019 and 2021, on average, a hundred thousand new mobile applications have been released on Google Play every month [1]. At this rate, it is inevitable for incorrect behavior to occur in application releases. A survey [2] shows that 78%sign of mobile GUI application users regularly encounter bugs causing applications to fail their intended functions. To minimize incorrect behavior without slowing down the software development process, developers need a practical way to verify functionality as much as possible before release.
In practice, a developer/tester verifies the correct behavior by observing it after executing a sequence of GUI actions with the help of a test automation tool. A GUI test automation tool can re-execute a given GUI action sequence, saving the developer/tester from manually executing every GUI action.
The associate editor coordinating the review of this manuscript and approving it for publication was Porfirio Tramontana . Furthermore, the developer/tester may insert simple assertions between GUI actions, such as observing a text. Then, these assertions automatically verify correct behavior.
A developer/tester using a test automation tool must still determine GUI actions and assertions related to the GUI function under test. According to a bibliometric analysis [3], automated GUI test generators have been a growing research topic for 30 years. A GUI test generator produces GUI action sequences, saving the developer/tester from manually determining the GUI actions.
These studies show that a specialized test generator is a necessity to target a specific testing goal.
Most existing research on mobile GUI testing focus on structural coverage and bugs/crashes, so the goal of verifying GUI functions remains a neglected research topic. Our experience with real-world mobile banking applications [33] shows that a state-of-the-art test generator is ineffective in witnessing GUI functions located deep in the GUI application. Hence, a practical test generator must exhibit better performance and effectiveness. In addition, the same experience shows that developers need an unambiguous, readable, maintainable, and automatically monitorable test scenario language to specify example use cases for GUI functions. The test scenarios then would act as automated test oracles for GUI functions.
We propose a novel test generator called fully automated reinforcement learning driven-2 (FARLEAD2). FARLEAD2 aims to maximize its effectiveness and performance via generalized experience replay (GER), a novel technique that exploits the experience gathered from previously witnessed scenarios on new, not-witnessed test scenarios. In addition, it introduces the novel staged test scenario (STS) language. STS is not just unambiguous, readable, and automatically monitorable but is also divided into multiple stages. These stages facilitate FARLEAD2's learning process by using the test scenario as a guide that positively rewards FARLEAD2, as it witnesses intermediate stages instead of just presenting it as a Boolean test oracle. Fig. 1 presents the overview of FARLEAD2. First, the AUT gets installed on a mobile device. The mobile device can be an emulator, a VirtualBox guest, a physical device, or a finite state transition system modeling the device. The only requirement is that it accepts an action as input and outputs its device state.
Second, the Developer/Tester provides a test scenario in the form of an STS related to the AUT. The STS monitor receives the STS, then computes the monitor state and calculates the immediate reward to supervise the reinforcement learning (RL) agent. Note that FARLEAD2 uses RL, a popular semi-supervised machine learning technique. RL outperforms humans in many fields like resource management [34], traffic light control [35], chess [36], Atari [37], and chemistry [38]. Furthermore, to the best of our knowledge, the best functional witness-generating predecessor of FARLEAD2 for Android uses RL, too [39], [40].
Third, the GER agent generates labels by processing every generalized unit experience residing in the experience database, where a generalized unit experience is a transition between two device states via an action. The GER agent receives rewards from the STS monitor and learns the initial policy. After exhausting all the generalized unit experiences, the GER agent sends the initial policy to the RL agent.
Fourth, after receiving the initial policy, the RL agent searches for a witness (a replayable action sequence consistent with the given STS) by selecting an action according to its current policy and executing it on a mobile device.
Then, the RL agent observes the device state and generates a generalized unit experience along with labels, storing the generalized unit experience in the experience database and sending the labels back to the STS monitor. The STS monitor calculates the reward from the labels and sends it back to the RL agent. The RL agent learns from the reward and continues searching until it finds a witness or gives up the search after a predefined limit.
Our main contributions to the literature are 1) the novel GER as a mobile GUI testing improvement for effectiveness and performance, 2) the novel STS as a readable test scenario instead of an LTL specification or Gherkin script, and 3) an experimental evaluation of FARLEAD2, demonstrating its superiority over its best predecessor in functional witness generation for Android GUI applications.
The remainder of this paper is organized as follows. Section II discusses the related work. Section III explains the necessary background on GUI testing, RL agents, and ER. Section IV describes our method. Section V presents our evaluation of FARLEAD2. Section VI discusses threats to validity. Section VII concludes by summarizing our findings and stating future research avenues.

II. RELATED WORK
In this section, we discuss related work in three categories, 1) GUI testing, 2) Android runtime monitoring tools that enable automatic following of a specified test oracle, and 3) Studies combining RL with a formal specification language, such as linear-time temporal logic (LTL) formulae, where an LTL formula acts as an automated oracle.

A. GUI TESTING
Graphical user interface (GUI) testing is the system testing of an application under test (AUT) through a GUI frontend [41]. As such, GUI testing naturally involves verifying the functional behavior of the AUT. This verification task has two challenges: automated test oracles and test generation.

1) AUTOMATED GUI TEST ORACLES
According to a systematic mapping of GUI testing works [41], most studies on GUI testing do not use any test oracle. We consider these articles to be out of our scope because they do not propose fully automated GUI testing methods.
In addition to being a necessity for fully automated testing, a test oracle also contributes significantly to test effectiveness [42]. GUI testing studies with test oracles mostly used either state references or crashes as test oracles [41]. State referencing is the practice of storing GUI states during test execution and then reviewing the stored states to evaluate AUT behavior. State references are impractical, because the number of GUI states available for a GUI application is typically large. Therefore, reviewing every GUI state is practically infeasible for the developer/tester.
When the test oracle checks for application crashes instead of state referencing, any crash-free behavior passes the test. Therefore, the test oracle fails to distinguish the correct behavior even if it observes that behavior and accepts faulty behavior if it does not crash. Therefore, the crash-type test oracles are also not particularly useful for functional verification.
Several studies [19], [23], [43] have used structural coverage criteria as a test oracle. However, traditional coverage criteria work poorly for GUI testing because they are suitable for conventional software systems and not for GUI applications [44], [45]. In addition, a test oracle targeting structural coverage suffers from a problem similar to crashtype test oracles; it may ignore correct behavior or accept faulty behavior as long as it observes a structural coverage increase.
Formal methods-based GUI test oracles target the automatic verification of the correct behavior. These test oracles either monitor a model or a specification during test execution and decide whether the test triggered a target behavior during runtime.
Model-based GUI test oracles require an approximately accurate GUI model to evaluate whether a test has passed. However, these models are often unavailable. Therefore, in practice, the test generator crawls the AUT to generate a model. However, if no one specifies the correct behavior, the crawler cannot know what to look for and is unable to generate the necessary model for verifying a target function. Moreover, conventional GUI models such as finitestate machines do not have sufficient descriptive power for verification [46]. Hence, a model-based GUI test oracle is unsuitable for verifying app-specific functional behaviors.
Initial specification-based GUI test oracle studies used operators with first-order logic semantics to describe the test oracle [45], [46]. Subsequently, GUICOP [47], [48] used a custom specification language with variables, properties, constraints, and relational operators based on propositional logic. These relational operators rely on arithmetic comparisons to describe GUI widgets relative to each other, such as a GUI widget residing next to another widget. Finally, our previous works [39], [40] used linear-time temporal logic (LTL) to describe similar test oracles for Android GUIs. In contrast to other logic systems, LTL has the advantage of providing a natural way to express a target behavior over time. Even so, our experience with a real-world banking application [33] shows that any logic-based specification as a test oracle is impractical because the developer/tester' 1) Must become proficient in the language or underlying logic (propositional, first-order, or temporal), 2) Maintains specifications, which would be easier if they were natural language instead of logical expressions, and 3) Presents the test oracle as a test scenario or a software requirement use case to the customer, who does not necessarily know coding or any logic system but needs to know that the AUT's functions work correctly. Hence, a practical test oracle for verifying functional behavior should be a test scenario close to a natural language but also unambiguous and runtime monitorable. Our experience with readable UI automation syntaxes, such as Gherkin [33], also shows that predefined semantics are not intuitive for such syntaxes, making them impractical for runtime monitoring.
To the best of our knowledge, a readable test scenario language that is also a practical automated GUI test oracle such as the STS language is nonexistent in the literature.

2) AUTOMATED GUI TEST GENERATION
The second challenge in verifying the functional behavior of GUI applications is to generate the correct GUI actions that triggers this behavior. Note that this challenge is different than AI-based planning for GUI testing [45], [51], which produces high-level test cases but leaves the coding of all critical decision points of a low-level, executable test to the developer/tester [44].
Several GUI test generators including AutoBlackTest [52], AntQ [53], TESTAR [54], QBE [19], curiosity-driven Q-testing [23], DroidBotX [24], SQDroid [25], Deep-GUIT [26], and SARSA [27] use RL to guide the test generation. All of these test generators are exploratory tools because they aim to explore a given AUT quickly and as much as possible. Hence, they infer reward functions from the structural coverage criteria. However, verifying a specific GUI function requires a goal-oriented testing tool rather than an exploratory tool. Table 1 shows the 33 Android GUI test generators investigated in chronological order. Among these test generators, 27 tools target structural coverage, crash detection, or both. Some specialized test generators focus on accessibility issues, energy consumption problems, common bugs, and triggering of sensitive APIs. These studies show that a specialized test generator is necessary to achieve a specific testing goal.
FARLEAD-Android is the first RL-driven GUI test generator that targets user-specified GUI functions by generating witnesses for these functions. Note that SQDroid uses functional semantics to improve structural coverage and does not target specific GUI functions. Our evaluation in Section V shows that FARLEAD-Android is not always effective in witnessing test scenarios that verify functional behavior. When ineffective, FARLEAD-Android executes as many GUI actions as possible on the AUT without producing anything useful for the GUI function under test, throwing away the experience it could exploit. This problem reinforces the FARLEAD-Android's impracticality for the developer/tester.
To the best of our knowledge, FARLEAD2 is the first RL-driven GUI test generator that targets user-specified GUI functions and utilizes previous experience to boost the test generator effectiveness.

B. RUNTIME MONITORING FOR ANDROID
A runtime verification (RV) tool monitors the AUT and reports whether an LTL test oracle passes. RV-Droid [55], RV-Android [56], ADRENALIN-RV [57], and Android-SRV [58] monitor Android AUTs. However, these tools monitor the LTL properties at the source code level. Instead, FARLEAD2 monitor was at the GUI level. In addition, the RV tools do not generate tests whereas FARLEAD2 generates tests while monitoring.

C. RL-LTL STUDIES
Several studies [59], [60], [61], [62] have developed RL-LTL systems. Using RL, these systems learn to obey the constraints specified in LTL. These RL-LTL systems must continuously perform their given tasks, without termination. Hence, as in typical RL-LTL approaches, they must converge to an optimal policy to guarantee the highest reliability. Instead, FARLEAD2 is an RL-LTL system with a finite task; therefore, it can terminate once the task is complete. Therefore, FARLEAD does not have to converge to an optimal policy, thereby saving from the learning time.

III. BACKGROUND
The following two subsections describe (i) the Android GUI environment (ii) the RL agent and (iii) ER.

A. ANDROID GUI ENVIRONMENT
We now discuss GUIs, GUI actions, test scenarios, and reward values.

1) GRAPHICAL USER INTERFACE (GUI) AND GUI ACTIONS
A GUI is a visual medium through which a user interacts with the AUT. We aim to automatically verify a GUI function where a GUI function is an operation of the AUT according to the software requirements of the AUT. Typically, witnessing a test scenario ensures the correct behavior of a GUI function, where the test scenario is just an example use case of a GUI function. A test generator witnesses a test scenario by executing a GUI action sequence consistent with the test scenario. Note that not every GUI action sequence corresponds to a test scenario. Therefore, a GUI action sequence is a candidate, and a candidate is a witness only if all its GUI actions are consistent with the test scenario. Table 2 lists the GUI actions that we support. Menu, back, and 2×back actions are universal actions that are always enabled. Click, long-click, scroll-up, scroll-down, scroll-left, scroll-right, and write have related GUI widgets, where a GUI widget is a GUI component visible on the screen. These actions were enabled only if a related GUI widget appeared on the screen. Technically, we determined the set of currently enabled actions by parsing the XML hierarchy of the current GUI layout. Only the write action has a parameter, which is the text to write on its related GUI widget. We mainly deduce this parameter from the test scenario, although it is possible to provide a dictionary of generic text inputs. Finally, the reinit action is a GUI action with no related GUI widget and is not universal. It is enabled only at the beginning of the GUI action sequence.

2) TEST SCENARIO AND REWARD VALUES
A test scenario is typically an informal description because it should be human-readable for a developer, tester, or customer with no coding skills.The test scenario must be monitored automatically, which is difficult when it is ambiguous. Hence, a test scenario must be a formal description that is unambiguous and runtime monitorable.
Given a test scenario, we aim to find the witness via trial-and-error, generating many candidates before the one consistent with the test scenario. We generated one candidate per episode, where each episode is a finite sequence of steps. We generated and executed a GUI action at each step. To learn after every step, we maintain a reward variable R. Most of the R values are typical for an RL agent. When R = +1.00, the candidate at hand (GUI action sequence) is indeed a witness. When R = −1.00, the candidate never becomes a witness because of its previous GUI actions. When R = 0.00, the candidate does not get closer to or farther from being a witness. In addition to these typical values, atypical partial reward values vary between 0.00 and +1.00. In this case, the candidate is not yet a witness but satisfies some of the conditions of becoming one. In the literature, using such partial reward values is known as Reward Shaping (RS) [63].
At every step, the monitor calculates the reward value automatically by checking the currently monitored propositions at that step for consistency with the test scenario. All of these propositions are Boolean. We observe some of these propositions during one step. All these observed propositions are labels for that step.
A step has two types of proposition: necessary and sufficient. These propositions create three possibilities for each step. (i) Sufficient propositions represent a subset of the labels. In this case, the monitor returns a positive reward, plus one if this is the last step of the given test scenario.
(ii) Sufficient propositions are not a subset of the labels, and at least one necessary proposition is not a label. The monitor then returns a minus one reward. (iii) In all other cases, the monitor returns zero. Note that a proposition is related to either the current GUI action or GUI state, where a GUI state is all the GUI widgets' attributes on the screen. Fig. 2 illustrates an example GUI step. On the screen to the left, every necessary proposition is a label. However, the ''text IS Moving to recycle bin'' proposition is not. We must generate and execute the correct GUI action, so this proposition also becomes a label, and we get a positive reward for that. In reality, there are many more GUI actions enabled on the screen to the left. However, for simplicity, we consider only the Click Cancel, Click Delete, and Click Note 1 actions.
Clicking Cancel closes the current screen. After this action, the monitor generates a minus one reward because (i) the ''text IS Moving to recycle bin'' proposition did not become a label, and (ii) the necessary propositions stopped being labels. Click Delete action opens a popup with ''Moving to recycle bin'' text without closing the current screen, making all the propositions labels. Therefore, the monitor generates a positive reward. This reward is plus one if the action witnesses the whole test scenario. Click Note 1 action clicks the text at the top, so the screen remains unchanged where still, the necessary propositions are labels, but the ''text IS Moving to recycle bin'' proposition did not become a label. Therefore, the monitor generates a zero reward. Note that all propositions in the example are state propositions. Action propositions may constrain what actions a test generator should take. In that case, we automatically reduce the set of enabled actions to avoid any future negative rewards and pursue positive rewards. Fig. 3 shows the flow of the RL agent. First, line (1) initializes the number of episodes (E) to zero. If the number of episodes is equal to or larger than the predefined maximum number of episodes (MaxE), the RL agent terminates because it has failed to generate a witness. Otherwise, the RL agent starts a new episode through lines (2)-(5). line (2) initializes R as zero. line (3) increments the number of episodes. line (4) empties the candidate. line (5) restarts the monitor, so the monitor starts to observe the test scenario from the beginning. The RL agent begins a new step in lines (6)- (8). Line (6) provides the monitored propositions in the current step. Line (7) obtains the set of enabled actions by the AUT. Line (8) reduces the set of enabled actions according to necessary and sufficient propositions. The RL agent reaches a dead end only if there are no enabled actions after the reduction. In this case, line (9) sets R to minus one, and line (10) updates the action selection policy according to R. Then, the RL agent starts a new episode. Otherwise, line (11) selects a GUI action according to the action selection Policy. Line (12) executes this GUI action, and line (13) appends it to the candidate. line (14) calculates the labels. Finally, line (15) calculates R from these labels, and line (16) updates the action generation Policy. If R = +1.00, the candidate is a witness, and the RL agent terminates. If R = −1.00, it starts a new episode. Otherwise, the RL agent proceeds to generate a new step by going to line (6). The RL agent eventually terminates because its monitor has an internal step counter that produces a negative one reward if there are too many steps in the episode.

B. REINFORCEMENT LEARNING AGENT
The RL agent effectively generates a witness only if it produces the witness within the predefined maximum number of episodes. Otherwise, there are two possible explanations. Either we need more episodes to find a witness, or the GUI function is nonexistent in the AUT. Note that the RL agent cannot prove the nonexistence of a GUI function such as a model checker.
In addition to verifying GUI functions, we can also reproduce a known GUI bug using a test scenario for the bug. From a test generator's perspective, a GUI bug is equivalent to a GUI function because it reproduces a GUI bug in the same way that it verifies a GUI function.

C. EXPERIENCE REPLAY (ER)
Experience replay (ER) [64] is an improvement in RL. ER exploits the experience stored in an experience database from previous tasks instead of discarding it. Fig. 4 illustrates an experience database as an ordered collection of unit experiences. A unit experience is a quadruple ((s, m), a, (s', m'), r), meaning that the AUT goes from one device-monitor statepair (s, m) to the next (s', m') by executing action a and obtaining reward r. The main idea is not to throw away but to save all the unit experiences from previous tasks and then reuse all the unit experiences for the task at hand.
One strength of ER is its ability to influence the RL agent's initial policy without executing any actions on the device.  Instead, ER immediately provides reward values to the RL agent from its experiences, allowing the RL agent to start with an initially better action generation policy than a random one. As a result, the RL agent should converge towards its objective faster while avoiding the execution costs of all the unit experiences. As ER gathers more unit experiences, it should become faster.
The underlying assumption of ER is generalization via connectionism. Connectionism implies that the unit experiences are related to the task at hand. Otherwise, they can hinder the learning effectiveness and performance instead of enhancing it. Generalization implies that a generalizable pattern exists between unit experiences. Otherwise, it would not be possible to learning anything from the unit experience.

IV. METHOD
FARLEAD2 improves RL-driven witness generation through GER and STSs. This section first explains why the traditional ER would hamper FARLEAD2 performance, so we implement GER. We then discuss STSs in detail.

A. GENERALIZED EXPERIENCE REPLAY (GER)
Every unit experience ((s,m), a, (s',m'), r) in a traditional ER has a fixed reward and monitor states; r, m, and m'. However, once FARLEAD2 witnesses a test scenario, the developer/tester will not use FARLEAD2 again but the already existing replayable witness instead. Hence, the developer/tester will always use FARLEAD2 with unique test scenarios (scenarios that FARLEAD2 has never witnessed before). Every unique test scenario yields a different reward function and monitor states. Therefore, if FARLEAD2 uses traditional ER, some of the recorded rewards are bound to be misleading for the new test scenario, hampering witness generation's effectiveness and performance.
The STS monitor generates its state and a reward value at every step by checking the labels of that step. The RL agent determines these labels by looking at the step's GUI action a and the device state s' that is reached after executing the GUI action. In other words, the reward r and the monitor state m are functions of the GUI action a and the device state s'. Hence, storing only the device states and GUI actions is sufficient to compute the monitor states and calculate the reward values for any test scenario.
A generalized unit experience is a triple (s, a, s'), meaning that the AUT goes from device state s to device state s' by executing action a. Note that storing only (a, s') is sufficient to calculate the reward value. However, it is insufficient to determine which state-action pair (s, a) that value refers. Fig. 5 demonstrates an example in which the generalized experience gathered in the first test scenario facilitates witnessing the second. These scenarios involve reaching different AUT screens in two steps. We already have a witness for the first test scenario. This witness has two GUI actions, A and B. For the second test scenario, we do not yet have a witness. Therefore, GUI actions C and D are unknown. Before any exploration, the GER module replays the generalized experience gathered from the first witness. During replay, GUI action A gets a positive reward value because it is consistent with the first step of the second test scenario. However, the GER module assigns a negative reward value to the GUI action B because it is inconsistent with the second step of the test scenario. Consequently, FARLEAD2 selects C=A with no exploration and eliminates B as a candidate for the second step. Overall, the search space for the second witness shrinks, thereby facilitating witness generation.
The GER agent first traverses all the generalized unit experiences in chronological order and receives rewards from the STS monitor. At this point, the GER agent does not learn the initial policy. Instead, it sorts all the generalized unit experiences according to their rewards, in ascending order. Then, it traverses all the generalized unit experiences in this order, learning from each generalized unit experience r +2 times, where r is the reward value. Overall, re-learning the positively rewarded generalized unit experiences many times and learning them after other experiences significantly increases their impact on the initial policy. Fig. 6 shows an overview of a an STS, where the inside angle and square brackets are variables and optional constructs, respectively. Keywords separated by slashes are alternatives. The main idea behind STS is that it divides the scenario into consecutive STS stages. FARLEAD2 searches for a candidate that witnesses the STS stages in the given order. The developer or tester may specify the maximum number of steps or the time available for the entire scenario or any STS stage. Within the given time constraints and step bounds, a candidate witnesses a stage only if all of its invariants are true (labels) until all of its eventual conditions become true (labels). In other words, all invariants and all eventual conditions are necessary and sufficient propositions, respectively. Note that this is like the until operator in LTL. All the invariants and eventual conditions are Boolean propositions. These propositions do not have to be atomic. They may have several terms connected via AND or OR operators.

B. STAGED TEST SCENARIO (STS)
We assume that every Boolean proposition is a triple (property, relation, and value). There are two types of properties: action and state properties. Three types of action properties exist. These are ActionType, ActionParam, and ActionDetail, which are linked to the GUI action type, the action parameter, and an attribute of the related widget, respectively. State properties are either crashed, package, activity, or one of any GUI widget's attributes on the screen. The relation is IS, IS NOT, CONTAINS, or NOT CONTAINS.
In Fig. 6, P, Q, R, and S are some example propositions. These propositions become labels only if the current activity name contains MainActivity, the screen has a text that writes exactly ''Note 1'', there are no texts that write exactly ''Note 1'', and the GUI action types the word ''Yavuz'' to username, respectively.

VOLUME 10, 2022
A stage may have optional steps. These steps are a list of action properties. For example, S is a step proposition, and P, Q, and R are not shown in Fig. 6.
The step propositions can be cumbersome to specify. To address this issue, we have developed shorthand notation. FARLEAD2 automatically converts every shorthand notation in the STS into a proper step proposition. Fig. 6 gives the shorthand notation for S between the round brackets.
The initial state of the STS monitor m indicates that the candidate has not yet witnessed any STS stage. Once the STS stage is complete, the STS monitor moves to the next state m . Reporting the monitor state is essential for teaching the RL agent the correct order of test steps necessary to witness a given test scenario.

V. EVALUATION
This section describes the experimental setup, discusses the research questions, and evaluates the experimental results.

A. EXPERIMENTAL SETUP 1) TEST GENERATORS
In this study, we compared three test generators, random (RND), reinforcement learning (RL), and generalized experience replay assisted RL (GER).
RND generates random GUI actions, ignoring all rewards. We included RND in our experiments as a baseline for evaluation. Therefore, any other test generator should outperform RND.
RL is equivalent to FARLEAD-Android. To the best of our knowledge, FARLEAD-Android is the most effective test generator for producing functional witnesses. Therefore, we aim to improve on RL.
GER is the proposed method for FARLEAD2. Our experiments demonstrate that GER is more effective than and outperforms both RND and RL.

2) EFFECTIVENESS
In our evaluation, a test generator was required to produce a witness within 100 episodes. Therefore, the effectiveness of the test generator is the percentage of times it is successful within this limit. We executed the same test generator ten times for the same test scenario under the same conditions. Thus, the witness generator is a hundred percent effective if it generates a witness ten times. Conversely, it is zero percent effective if it fails at all times. We took the average across all test scenarios to measure the overall effectiveness of the test generator.
In summary, the effectiveness is the expected number of witnesses that the test generator produces out of 100 attempts. An ineffective test generator would waste the developer's time without generating any witnesses; therefore, the most critical aspect of a practical test generator is to be as close to a hundred percent effective as possible.

3) PERFORMANCE
A witness generator outperforms the others if it terminates faster. We have two measures reflecting performance, (i) the total number of steps and (ii) the total seconds it takes until termination. We examine the first measurement to ensure that the latter does not suffer from noise caused by the varying execution times of individual GUI actions on the mobile device. Because we executed the same scenario under the same conditions ten times, we took the average of both performance measures.

4) THE MOBILE DEVICE
Throughout our experiments, the mobile device was a Virtu-alBox guest with 1024 megabytes of random access memory. The operating system of this device is the Intel x86 port of Android 6.0. Using a VirtualBox guest, we create exact clones of our experimental environment, allowing mass witness generation for different test scenarios in parallel. Furthermore, no physical mobile devices or hardware preparations were required to replicate the experiments.

5) ANDROID APPLICATIONS
We used the Themis Automated Android GUI Testing Benchmark [65] and the F-Droid repository to locate Android applications to evaluate our test generators. Themis is a wellknown and maintained benchmark, recently developed to compare Android GUI test generators. F-Droid is an opensource Android GUI application repository. Multiple Android GUI test generators in the literature [19], [24] use Android applications from this repository.
We evaluate the experimental test generators over two Android applications: Notes from F-Droid and Wikimedia Commons from Themis. Notes is a small-sized (2 MBs) Android application similar to the other small note-taking applications in the Themis benchmark, such as Omni-Notes and Scarlet-Notes. The Commons application is a mediumsized (17 MBs) Android application.
The Notes application allows users to create four types of notes: audio, text, sketch, and checklist. Furthermore, users can construct categories and divide notes into these categories. This application has a known bug in its sketch notes where the color palette has no black color, preventing users from making black drawings [66]. The Commons application allows users to search for pictures in the public domain. Users may view these pictures and their descriptions.

6) TEST SCENARIOS
Our experimental setup has 24 test scenarios: 17 and 7 for the Notes and the Commons applications, respectively. The complexities of these test scenarios vary between 2 and 13, where we define the test scenario's complexity as the length of its shortest witness. The shortest witness length is the minimum number of steps (GUI actions) necessary to witness a test scenario. We argue that a test generator would face more difficulties in a complex test scenario because of the number of unknown steps that need to be discovered. Fig. 7 shows an example of a manually generated witness for test scenario 014. The existence of this witness places an upper bound of seven on the complexity of this test scenario. We manually produced witnesses for all the test scenarios to determine their complexity.
An STS is a flexible structure, allowing the developer/tester to incorporate apriori information about the test scenario. We call this information hints. According to the hints given in an STS, there are two extreme STS types: declarative and imperative.
A declarative STS only contains the information necessary for a scenario. This information includes (i) the invariants and (ii) the eventual conditions of every stage. Therefore, the developer/tester declares only what the test generator should witness. In contrast, an imperative STS defines the steps of every stage. It shrinks the search space; therefore, there is often only one candidate. However, it is cumbersome to maintain an imperative STS because it requires restructuring after almost any software update, whereas a declarative STS should work across multiple versions of the AUT.
For every experimental test scenario, we have four STSs, with four levels of hints: L4 (imperative), L3 (manual), L2 (automated), and L1 (declarative). Hence, for the 24 test scenarios, we obtained a total of 96 STSs. Fig. 8 shows the L1-L4 STSs for test scenario 014. The first stage of L1 has no invariants but only one eventual condition, starting the AUT package on the device. The second stage has one invariant, indicating that the AUT package must be active until the second stage's eventual condition is satisfied, so ChecklistNoteActivity is on the screen. Again, the third stage has the same invariant, describing that the AUT package must be active, but now it is until the device ends up in the ChecklistNoteActivity, while there is a text that writes ''checkitem'' and there is a checked checkbox on the screen. Overall, L1-STS states that (i) eventually, the AUT must be opened. (ii) After that, eventually, the ChecklistNoteActivity must be opened. (iii) Finally, the ChecklistNoteActivity must be on the screen with the checklist containing a checked item, and the ''checkitem'' text appears on the screen. Whenever FARLEAD2 encounters a text proposition, it automatically considers writing that text to any appropriate GUI widget an enabled action.
We automatically generated L2-STSs from L1-STSs through intent-resolution analysis [6]. Intent resolution analysis is a static analysis of an Android GUI application binary VOLUME 10, 2022 Y. Koroglu, A. Sen: Fast Witness Generation for Readable GUI Test Scenarios via Generalized Experience Replay that extracts the static activity transition graph (SATG) of an AUT. The SATG determines from any activity to which a tester can go, and using the SATG, FARLEAD2 updates the given STS with extra invariants and eventual conditions. For test scenario 014, FARLEAD2 determined that it could reach ChecklistNoteActivity via MainActivity. Therefore, it automatically restricts its search for these activities by adding activity constraints to the appropriate stages of the STS. Overall, L2-STS reduces the search space without manual effort.
The L3-STS incorporates any hints that the developer or tester may provide, except for the steps (GUI actions) themselves. First, MAXSTEPS defines the maximum number of steps allowed for each stage. Setting every MAXSTEPS condition to an absolute minimum forces the test generator to produce the shortest witness. As a result, the test generator spends more time than finding an arbitrary witness. Therefore, we slightly relaxed MAXSTEPS values. Second, ''actionType'' propositions restrict the action type. Third, an extra stage called ExpandDrawer states that the text ''New'' must appear on the screen before reaching the ChecklistNoteActivity. All these additions shrink the search space but require additional manual effort.
The L4-STS is imperative and describes all the steps, so almost no learning is required. Fig. 7 shows that after the GUI action D, FARLEAD2 must still learn the correct GUI widget to write on. With the L4-STS, search space is the smallest, but the developer/tester determines all the GUI actions manually, making the manual effort of writing an L4-STS the highest among all STSs. Still, writing an executable test script requires coding skills, whereas an L4-STS is a no-code script. Thus, writing an L4-STS requires less effort than writing a test script.
We have designed one test scenario (four STSs) to reach each activity of the Notes application (test scenarios 001-009). Hence, witnessing all the test scenarios achieved full activity coverage. We have created a test scenario (test scenario 012) to reproduce the palette bug [66]. Finally, the rest of the test scenarios concern the main GUI functions of the Notes application, namely, creating, deleting, recycling, and categorizing notes. For the Commons application, the first five scenarios were activity-reachability scenarios. The remaining two verified different GUI functions of the application.

7) GENERALIZED EXPERIENCE REPLAY (GER) SETUP
GER depends on the experience gathered so far. Our experimental setup starts with no experience and executes GER on the test scenarios in the order of increasing test complexity. Even though GER re-witnesses an STS multiple times in our experiments, it never uses the experience of the same STS. GER selects one run per previous STS and uses the cumulative experience gathered only from these STSs. Hence, we expect GER to produce results similar to those of RL in test scenario 001. GER will have the most experience when it witnesses the most complex test scenario (Notes, test scenario 017). Note that we used separate experience databases for each STS level. Finally, although we perform our experiments in parallel, GER waits for the previous scenarios to finish before generating witnesses.

8) OVERALL
Our experimental setup has three test generators (RND, RL, and GER), 96 STSs, and ten runs for each test generator-STS combination. Hence, there were a total of 3840 experimental runs. We collected four values for every experimental run: (i) success/fail, (ii) total number of steps, (iii) total number of seconds, and (iv) witness length. The first value measures effectiveness, the second and third values measure performance, and the last value measures test complexity.

B. RESEARCH QUESTIONS
Our experimental setup aims to answer the following research questions.  RQ1 verifies that every experimental test scenario has positive utility in evaluating effectiveness, performance, and test complexity. If the underlying GUI function that a test scenario exploits is nonexistent in the AUT, there are no witnesses for that test scenario. Then, the effectiveness will be zero percent regardless of the test generator, and the performance and witness length measurements would be infeasible. We aim to show that at least one witness exists for every experimental test scenario.

RQ1. (Feasibility
RQ2 evaluates the most crucial criterion for a witness generator: its effectiveness. Depending on the test scenario, an ineffective test generator frequently fails in practice, frustrating the developer/tester. We aim to ensure that GER is more effective than RND and RL in witness generation.
RQ3 evaluates how fast a test generator terminates. A faster and more effective test generator would produce more witnesses within a constant testing budget, providing the developer/tester more utility. We aim to demonstrate that GER outperformed RND and RL in our experiments.
Finally, RQ4 aims to determine GER's effectiveness and performance under different STS levels. Table 3 shows our overall experimental results comparing the effectiveness, time, steps, and witness lengths of RND, RL, and GER. Each result for the Notes and Commons applications is an average across the STSs of that application. Each overall result is an average across all STSs.

C. EXPERIMENTAL RESULTS
(RQ2 and RQ3) Overall, FARLEAD2 (GER) generated a witness 95.7 times out of 100 and did it in 140 steps and 520 seconds, on average. FARLEAD2 (GER) was 6.3 percent more effective and 339 seconds (approximately 65 percent) faster than its best predecessor, FARLEAD-Android (RL). Furthermore, it was 28.6 percent more effective and 582 seconds (112 percent) quicker than RND.
For the small-sized Notes application, FARLEAD2 (GER) generated a witness 94.3 times out of 100 and did it in 167 steps and 555 seconds, on average. Hence, in the Notes application, FARLEAD2 (GER) was 7.2 percent more effective and 223 seconds (approximately 40 percent) faster than its best predecessor, FARLEAD-Android (RL). Furthermore, it was 27.8 percent more effective and 95 seconds (17 percent) quicker than RND.
For the medium-sized Commons application, FARLEAD2 (GER) generated a witness 99.3 times out of 100 and did it in 73.6 steps and 433 seconds, on average. Hence, in the Commons application, FARLEAD2 (GER) was 4.3 percent more effective and 622 seconds (approximately 144 percent) faster than its best predecessor, FARLEAD-Android (RL). Furthermore, it was 30.7 percent more effective and 1766 seconds (408 percent) quicker than RND.
All results in Table 3 show that FARLEAD2 (GER) is consistently more effective and faster than its alternatives, revealing the benefits of experience replay. Fig. 9 shows the effectiveness scores of RND, RL, and GER for every test scenario averaged across all STS levels (L1-L4). The lines in this figure represent the effectiveness scores listed in Table 3. (RQ1) RND, RL, and GER did not have zero effectiveness in any test scenario, indicating that all the test scenarios are feasible. Fig. 10 shows the test generation times (Fig. 10a) and the number of steps (Fig. 10b) of GER under the L1-L4 STSs. Because there were no discrepancies between step count and time results, we assume the noise in test generation times was negligible. Therefore, in this case, the test generation time is an accurate performance measure.
(RQ4) Fig. 10a shows that for both the Notes and Commons applications, more hints yielded faster test generation times. Without hints (L1 -declarative), it took FARLEAD2 (GER) more than 1000 seconds to generate a witness on average. Enabling L2 STSs significantly improved the average test generation performance. However, Fig. 10a further shows that test scenarios 014 and 016 of the Notes application required more than 4000 seconds with L1 and L2 STSs, respectively, whereas L3 STSs never took more than 2000 seconds. Therefore, with manually generated L3 STSs, GER provided more reliable performance than L1 or L2 STSs. L4 STSs required almost no time to witness.

VI. DISCUSSION
Now, we discuss the threats to the validity of our experimental setup, methodology, and implementation. Specifically, we elaborate on the construct and external validities.

A. CONSTRUCT VALIDITY
The maximum number of steps allowed in each episode was 30. The highest test complexity in our experiments is less than 30; therefore, this limit will not cause ineffectiveness in any test generator. In addition, our experience with Android GUI applications shows that developers implement simple GUI functions such that any user can perform them via fewer than 30 consecutive actions.
In our experiments, the episode limit is 100. Our experience shows that the test generators become ineffective when this limit is too low. A high episode limit causes the test generator to waste more time when a witness cannot be found. We refrained from finding the optimal episode limit to avoid bias toward our experimental AUTs. Instead, we arbitrarily choose the episode limit, considering only the testing budget.
Because we re-executed every experimental run ten times, the GER setup collected experience for the same scenario multiple times. However, in practice, the developer/tester would never re-execute FARLEAD2 on a test scenario with an already available witness. To reflect this fact in our experimental environment, we forced every run of a test Y. Koroglu, A. Sen: Fast Witness Generation for Readable GUI Test Scenarios via Generalized Experience Replay scenario to use only the experience gathered during the respective runs of the previous test scenarios. Note that all of our experimental test scenarios are distinct; therefore, GER always uses the experiences of witnessed test scenarios on unwitnessed test scenarios.
Every test generator that we evaluated must monitor the STSs. The original FARLEAD-Android monitors LTLs instead of STSs; therefore, we had to make technical modifications to implement RL in our experiments. During the experiments, we noticed that cycles in the state transitions may create a positive feedback loop that stuck FARLEAD (RL) and FARLEAD2 (GER). Thus, our modified implementation proceeds to the next episode when it encounters a cycle that is not a self-loop. For fairness, we also force RND to go to the next episode in the same cases.
Finally, although there are many Android GUI test generators in the literature, we did not include all of these generators in our evaluation. First, to the best of our knowledge, none of these test generators implement STS monitoring. Second, even if we implemented STS monitoring on top of these test generators, it would be an unfair comparison because these modified test generators cannot benefit from the rewards produced by the STS monitor. Hence, we compared our proposed method to FARLEAD-Android, the best predecessor of FARLEAD2 for generating witnesses for given GUI test scenarios in Android. Note that FARLEAD2 implements experience replay on top of FARLEAD-Android. Therefore, we compared FARLEAD2 with FARLEAD-Android to demonstrate the benefits of experience replay.

B. EXTERNAL VALIDITY
FARLEAD2 (GER) achieved 100 percent activity coverage on Notes and Commons applications. RND and FARLEAD-Android (RL) also reached 100 percent activity coverage in our experiments because we specified one test scenario per activity, and all the test generators producd witnesses for every test scenario. As a result, we did not include activity coverage in our comparisons because it did not distinguish test generator effectiveness and performance when witnessing a test scenario. The developer/tester must provide feasible test scenarios for all activities of an AUT to achieve 100 percent coverage.
The set of supported actions directly affects witness generation effectiveness. Our experience shows that one back is sometimes insufficient to return to the previous activity. Hence, we introduced the 2×back action. In addition, some real-world applications involve dynamically-loaded activities. These activities may require the user to wait for a few seconds before performing any action. Therefore, for practical use of FARLEAD2, it may be necessary to introduce the option of waiting for a few seconds as a GUI action.
The Android applications in our experiments were standalone. However, for example, the developer/tester needs at least two devices to test a messaging application. Thus, the witness becomes not just a GUI action sequence but at least two GUI action sequences interleaved. Additional implementation is necessary to support such test cases.
Finally, the FARLEAD2 performance and effectiveness results under increasing test complexity may not generalize to overly complex test scenarios, owing to the explosion in the search space. If the shortest witness for a test scenario is too long for FARLEAD2 to find, the developer/tester might consider dividing the test scenario into two or more test scenarios.

VII. CONCLUSION
In summary, we have proposed FARLEAD2, a fast witness generation method for readable test scenarios using generalized experience replay (GER). We have described the novel staged test scenario (STS) language and explained how GER works with STS instances via flowcharts and examples. Our experiments have shown that FARLEAD2 generates a witness 95.7 times out of 100 and does it in 520 seconds, on average, indicating that FARLEAD2 is approximately 65 percent faster and 6.3 percent more effective than its best predecessor.
In the future, we will evaluate how different test scenario orderings affect witness generation and determine the best test scenario ordering for the developer/tester. We will execute GER, RL, and RND on large-scale AUTs to generalize our results further. We will train multiple RL agents with different test scenarios (hence, multi-objective) within the same execution, enabling simultaneous witness generation for many test scenarios. Finally, we will also conduct a manual testing study to evaluate the benefits of automated witness generation.