Automated Detection of Software Performance Antipatterns in Java-Based Applications

The detection of performance issues in Java-based applications is not trivial since many factors concur to poor performance, and software engineers are not sufficiently supported for this task. The goal of this manuscript is the automated detection of performance problems in running systems to guarantee that no quality-based hinders prevent their successful usage. Starting from software performance antipatterns, i.e., bad practices (e.g., extensive interaction between software methods) expressing both the problem and the solution with the purpose of identifying shortcomings and promptly fixing them, we develop a framework that automatically detects seven software antipatterns capturing a variety of performance issues in Java-based applications. Our approach is applied to real-world case studies from different domains, and it captures four real-life performance issues of Hadoop and Cassandra that were not predicted by state-of-the-art approaches. As empirical evidence, we calculate the accuracy of the proposed detection rules, we show that code commits inducing and fixing real-life performance issues present interesting variations in the number of detected antipattern instances, and solving one of the detected antipatterns improves the system performance up to 50%.


INTRODUCTION
T HE performance evaluation of Java-based applications is challenging due to many variabilities, such as software failures and workload fluctuation in requests [1], [2], [3], that may occur when the system is running and inevitably contribute to affect the overall service quality [4], [5], [6]. Understanding if an application can always meet the desired performance (e.g., the system response time must be shorter than 5 seconds or resource utilization must be lower than 80%) is of key relevance since it impacts the perception of end-users and their satisfaction while interacting with the system [7], [8].
In the literature, several approaches have been proposed for modeling, analyzing, and optimizing the performance of software applications [9], [10]. Two main directions have been pursued: (i) model-based performance analysis, i.e., performance models are built out of Java applications [11], [12] and used for predictions; (ii) application performance monitoring, i.e., tools that collect trace data for inspection [13], [14]. Motivated by the recent trend of integrating development (Dev) and operations (Ops) teams, processes, and tools [15], [16], [17], it is necessary that software engineers are aware of the performance evolution of their applications. If performance issues are detected, then engineers must also be able to promptly fix such issues. To this end, several approaches emerged, e.g., automated performance tests [18] to guarantee the prompt identification and fixing of performance degradation, or performance load testing [19] to evaluate software refactorings that most likely lead to performance improvement. However, most of the approaches in the literature, e.g., [2], [20], [21], [22], act statically on the implementation code. A recent study [23] pointed out that static code analysis may fail in capturing complex root causes of real-life performance issues, e.g., the interactions between procedures that occur when executing the source code only. Hence, in this paper we aim to exploit dynamic information, which is fundamental for the detection of some performance issues, at the cost of deploying a testing environment for profiling applications and introducing runtime efforts.
We focus our attention on identifying performance issues in Java-based applications, i.e., the target systems under analysis are already in production and subject to multiple variations, such as changes in the execution environment. As a motivating example, let us consider a real-life case study [24] where a performance overhead of 17% is experienced for the continuous integration of a software release. The diagnosis of performance problems is indeed non-trivial, a study in [25] indicates the Apache project, i.e., relevant to our research since Java is used, with the longest average diagnosis time (194 days). Our goal is to improve the system performance by identifying the bad practices of software components and fixing them before the system becomes unusable. Let us consider as an example of bad practice a software component that monopolizes the processing (namely the Blob [26]). This implies a single and complex controller component that orchestrates the computation by extensively interacting with other components. As a consequence, the system response time can suffer by such a behaviour. To fix this problem, it would be beneficial to involve other software components and delegate them part of the overall computation. This way, the system response time can improve since it benefits from some processing running in parallel. To achieve this objective of identifying bad practices, we use software performance antipatterns [26], [27] since they include the description of both (i) the problems leading to performance flaws, and (ii) the best practices aimed to get performance improvements.
In the context of Java-based applications, we focus on the following seven software performance antipatterns: 1) Circuitous Treasure Hunt (CTH), 2) Extensive Processing (EP), 3) Wrong Cache Strategy (WCS), 4) Blob, 5) Tower of Babel (ToB), 6) Empty Semi Trucks (EST), and 7) Excessive Dynamic Allocation (EDA). Further details on these antipatterns and the motivation on the selection of these antipatterns are provided in Section 3.3. We develop a framework to automatically detect these seven performance antipatterns and we evaluate it on a variegate set of real-world applications. The conducted experimentation advocates the following main findings: (i) our framework is efficient, the detection of antipatterns is performed, on average, in less than a minute; (ii) our framework is accurate, the F1 score is larger than 85% in the considered cases; (iii) our framework advances stateof-the-art methodologies, it detects complex performance problems not recognized by other tools; (iv) antipatternbased refactoring can lead to system performance improvement up to 50%. The main contributions of our work are as follows: the specification of seven software performance antipatterns that are customized to verify a set of properties for Java-based applications; the development of JPAD, Java Performance Antipattern Detector, a framework that automatically detects the seven software performance antipatterns; the evaluation of JPAD efficiency and accuracy on five real-world applications with different complexity and representative of multiple domains; the comparison of JPAD with state-of-the-art approaches on the detection of real-life performance issues in nine code commits of two further systems; empirical evidence on the benefit of solving performance antipatterns. In summary, our approach advocates the usage of software performance antipatterns as valuable support to automatically detect performance issues of Java-based applications. The benefit is that software engineers are promptly informed of software components showing specific bad practices and candidate of being refactored.
The rest of the manuscript is organized as follows. Section 2 reviews the related work. Section 3 describes our approach, and we discuss the key properties of software performance antipatterns, thus to motivate the choice of implementing some of them. Details on detection algorithms for the seven implemented software performance antipatterns are provided in Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TSE.2023.3234321, the rationale for not implementing some of them is explained in Appendix B, available in the online supplemental material. Research questions, analyzed software systems, and the experimental evaluation are presented in Section 4. Threats to validity are argued in Section 5. A discussion on limitations of the approach is reported in Section 6. Concluding remarks and possible directions for future research are outlined in Section 7. Replication data are publicly available [28].

RELATED WORK
Our work mainly relates to three streams of research, i.e., architectural antipatterns, code smells, and Java-specific performance issues that are briefly reviewed in the following. This manuscript moves a step forward in the attempt of establishing synergies between architectural antipatterns and code smells for the performance evaluation of Java applications.
Architectural Antipatterns. In the broader context of (anti) patterns and quality attributes (e.g., reliability, security), there are several works that aim to match their connections, e.g., [29], [30], [31], [32]. When focusing on performancerelated concerns there is much less work. Software performance antipatterns are studied first by Smith [26], [27] who provides the preliminary definitions based on her experience. The specification is expressed in natural language and is technology-independent, meaning that antipatterns can be customized in many different contexts. Other researchers redefine these natural lagnuage definitions using first-order logical predicates later applied to architectural design models [33] and recently adapted to further architectural formalism such as probabilistic model checking [34]. A first attempt of adopting architectural antipatterns in running systems is provided in [35], where problem root causes are isolated and a graph of dependencies is built to match problems with the specification of antipatterns.
Static and Dynamic Approaches for Code Smells. In the literature, extensive work is devoted to investigate code smells [36], [37], [38], and several investigations are performed, e.g., the analysis of (i) inter-smell interactions to understand their effects [39] and (ii) sequences of different kinds of bad smells to improve detection and solution [40]. Static analysis techniques are adopted to locate bugs in software, e.g., performance bugs that waste processing time due to superfluous loop iterations are detected in [21], and tools like FindBugs [41] can find potential root causes for performance antipatterns. Object-Relational Mapping (i.e., non-trivial database access) is exploited in [20] where static analysis can detect a huge number of performance antipattern instances. Li et al. [22] keep using static analysis, but focus on problematic duplicate logging code smells to emphasize that logging code is highly associated with both the structure and the functionality of the surrounding code. An approach to identify code changes that may potentially be responsible for performance regressions is proposed in [42], but it does not analyze root causes behind such regressions. An exploratory study on performance regressions is presented in [43] where six code level root-causes are identified, but they mainly refer to inner changes, e.g., function calls or parameter values. An attempt of using static and dynamic metrics is proposed in [44] where evolutionary algorithms are adopted to detect performance regressions, however causes are not treated. Static and dynamic data is exploited also in [45] where a selection of benchmarks limits runtime overhead at the cost of missing the prediction of some performance issues. Preventing performance issues before the commit of code changes is pursued in [23] where random forest classifiers are trained on large datasets of performance regressions.
More recently, an experience report on locating the root causes of performance regressions is presented in [46] where web-access logs are exploited to tune the expected workload. However, this report targets only web-based systems, and machine learning techniques may lack to capture the relationship between problematic runtime activities and their impact on system performance. Summarizing, the main difference between static and dynamic approaches is that the former can detect some performance issues during the development process and cost fewer efforts, whereas the latter can benefit from runtime information probably capturing a larger set of issues (as confirmed in [23]), with the drawback of introducing monitoring efforts.
Java-Specific Performance Issues. Our work mainly relates to Java-specific performance antipattern definitions, and there are several approaches that identify different bad practices in Java platforms and APIs [47], [48], [49], [50]. The difficulty of evaluating the performance of Java applications is acknowledged also in the testing domain; Java open source projects are usually subject to a limited number of performance tests that are rarely updated and typically maintained by a small group of developers [51]. In [52] a rule-based approach is proposed to detect performance antipatterns from runtime traces while targeting Java EE antipatterns. In [53] load testing and profiling data are exploited to detect bad practices in Java applications. This approach leverages performance experts to identify problematic snapshot(s), and the antipattern detection is performed (similarly to approaches dealing with performance regressions [23], [45]) comparing snapshots with the recognized problematic one(s).
To the best of our knowledge, there are few attempts in the direction of bridging performance and code-related issues with performance antipatterns. Our approach relies on dynamic analysis and adopts software performance antipatterns to identify bad practices arising when the system is in execution. The main difference w.r.t. work more closely related to ours [35], [53] is that we consider a plethora of seven software performance antipatterns applied to a broad set of real-world case studies. Experimental results are promising to foster further research.

OUR APPROACH
This section introduces the approach proposed to detect software performance antipatterns in Java applications. The design of the approach is driven by the following key insight. Each performance antipattern describes a bad design practice that can be (partially) observed and checked by a combination of a particular set of system key properties including design characteristics (e.g., large number of calls) and performance metrics (e.g., long execution time). Our detection approach relies on monitoring these system key properties. A system component is recognized to be an antipattern when its design characteristics and/or performance metrics deviate most from the average values, as calculated considering all other components belonging to the same system. Fig. 1 depicts the workflow of our approach to automatically detect software performance antipatterns in Java applications. First, a load test suite is defined for each system, and it is used to simulate the interaction of users with the system under analysis. Then, the system under analysis is launched and the profiler is attached. We are aware that runtime performance monitoring of Java applications is expensive and the profiling process can generate overhead, however this is also assessed as a necessary task to collect the performance characteristics of interest [54]. After attaching the profiler, test suites are executed. During their execution, we periodically capture snapshots that contain the readings related to CPU, threads, and memory, later exported in CSV or XML files, according to the represented data. These files are provided as input to JPAD, i.e., the tool we developed for the automatic detection of software performance antipatterns.
In the sequel of the section, we detail the operational tasks of the proposed framework, i.e., how the load testing is performed, the technology used, and the criteria adopted for defining the test suite. Then, we present the application profiler used to monitor the systems under analysis during the execution of the load tests. We also describe the seven software performance antipatterns that we implement into JPAD to enable their automatic detection. Starting from the natural language definition of antipatterns, the detection rules are expressed in terms of the data acquired through the profiler. JPAD takes as input the readings from the profiler and makes use of such data to automatically perform the detection of software performance antipatterns.
JPAD is a JavaFX application that allows the user to load the profiling data exported from YourKit (i.e., the input files) and contains an embedded console where the results are reported. The tool works with thresholds and offsets that establish when performance issues arise. A threshold value t represents a boundary on a specific performance metric value v, and JPAD verifies if v > t. An offset value o is instead a percentage addendum that allows to deviate from systems' properties extracted as average values, i.e., a property on a specific system component (p) is compared with the average across all components (P ), and JPAD verifies if p > P þ P Á o%. Note that both thresholds and offsets are used to evaluate a specific test input. The goal is to detect those system characteristics that deviate, under the same input assumptions, with a certain margin from requirements and/or average values. For instance, 10% can be set as a threshold for the CPU usage, and JPAD considers suspicious all CPU resources showing larger values. As a further example, let us consider 5% as the offset on method number of calls. JPAD calculates the average value (considering the total number of calls across all the system methods) and augments such average by 5%. This way, JPAD detects as suspicious all the methods whose number of calls deviates by 5% from the average within a specific test input. Note that these threshold and offset settings can be modified by users in case further knowledge is available and other values are considered more appropriate. Sensitivity analysis on threshold values is discussed in Appendix C, available in the online supplemental material. 1

Load Testing
To define the required tests for each system under analysis, we explore the target application to understand key functionalities in execution. We are aware that designing load tests represents a threat to the internal validity of the approach. Indeed, we may miss some functional test cases, and there might be functionalities that are not monitored. Hence, there is a risk of not detecting some antipatterns and consequently not capturing performance issues. This threat is smoothed by delegating the design of load tests to software engineers that can decide which system functionalities require performance testing. To support software engineers in this task, as a rule of thumb, we foresee the selection of the software component(s) showing the highest CPU utilization as candidate to load testing. To support this guideline, in our experimentation (see Section 4) we show that such a selection leads to detect real-life performance issues.

Application Profiling
To profile the applications under analysis during their execution, we use the YourKit Java Profiler [55]. This decision is motivated by several reasons. First, YourKit is widely used for the performance evaluation of real-world applications both in industry [56] and in academia [53], [57]. Second, precise instructions about reducing or avoiding performance overhead (generated from the profiler) are provided [58]. Third, YourKit is a powerful application profiler, it monitors several aspects of the profiled system, and provides a view for multiple system features. Besides, the output of YourKit (i.e., the readings, see Fig. 1) can be easily exported in different data formats. For the purpose of this work, we monitor usage statistics: CPU usage, memory usage, CPU hotspots, call tree, blocked threads, and garbage collected objects. CPU and memory usage measure the percentage of used resources during the execution of the application. CPU hotspots are those methods that spend the longest time on the CPU. Call tree is divided in two different views: (i) a merged call tree that shows a top-down call tree of all application threads merged together into a single tree, and (ii) a call tree by thread that shows an individual top-down call tree for each application thread. From these files we can extract further information, e.g., callers and callees of each method. Blocked threads represent threads that fail to immediately enter the synchronized method/block. Garbage collected objects allow estimating the load of the garbage collector. All these statistics are fed to JPAD that uses them to detect software performance antipatterns.

Antipattern Detection
The detection of software performance antipatterns relies on our interpretation of the natural language specification of software performance antipatterns [26]. Specifically, after extracting key properties, we implement the antipattern detection rule if a match with profiling data exists. Our effort is mainly devoted to match high-level guidelines and make them concrete for the inspection of Java applications. In this context, the difficulty relies on combining different sources of information (e.g., a class calling a high number of methods, high CPU usage), since the intrinsic nature of software antipatterns is to look for various performance problems that may arise when applications are running. Since the specification of antipatterns cannot be completely precise [26], the conditions and key properties we check as our detection rules approximate the antipatterns. These conditions are neither sufficient nor necessary, we empirically validate their relevance on spotting performance issues. Table 1 reports a subset of seven software performance antipatterns proposed in [26], and we motivate in the following the choice of implementing them. The reason why some antipatterns cannot be automatically detected is argued in Appendix B, available in the online supplemental material. The first column reports the name of the performance antipatterns, and the second column describes the problem expressed in natural language. The third column lists the extracted key properties (along with the implemented helper functions) and provides a match with profiling data. The fourth column lists thresholds and offsets that are included in the detection algorithms. Helper functions are briefly described in Table 2, thresholds/offsets are presented in Table 3 along with the heuristics adopted in case users do not define their own preferences on perceived performance issues. Offsets are all set to 5% since we are interested to capture small deviations from average values. Thresholds are set to 10%, we refer to [59] where the CPU load in the idle phase is estimated to be 7% on average, and we are interested to exclude outliers and fluctuations that might be due to system's internal routines.
The detection algorithms and implementation details of these antipatterns are given in Appendix A, available in the online supplemental material, along with the match between textual descriptions and detection rules. In the following, we briefly discuss our interpretation of antipatterns, and their key properties.
Circuitous Treasure Hunt (CTH) occurs when a Java application must perform a large number of (database) queries to manage a request. This problem can be generalized considering a method that performs a chain of queries where the result of one query is used to build the next one, instead of writing' a single, and more complex request. As key properties, we check the average number of calls performed by each thread (call-tree-by-thread view of YourKit), along with the average processor utilization (chart-cpu-usage view of YourKit) indicating if the system performance suffers.
Extensive Processing (EP) occurs when a long running job monopolizes the processor and creates a queue of processes that cannot be executed until the computation of such job is completed. As key properties, we monitor the average number of blocked threads (monitor-usage view of YourKit), and the execution time of Java methods (call-tree-all-threads view of YourKit), selecting only those methods that are showing a large number of the identified blocked threads.  Wrong Cache Strategy (WCS) occurs when too many objects (or objects hardly ever used) are cached. This leads to generate performance overhead resulting in high memory usage. As key properties, we check the average memory usage of methods (method-list-allocation view of YourKit), and we are interested to verify whether the memory is more used than the processor (method-list-cpu view of YourKit).
Blob occurs in two different scenarios that contribute to distinguish two different types of antipattern instances in Java applications, i.e., Blob-Controller and Blob-DataContainer. The former is observed when a class centralizes many responsibilities, delegating minor roles to other classes. Classes affected by this problem usually are complex controllers which depend on simpler classes (with little to no computation). The latter case is observed when a class includes most of the application data and other functions need to access that class to retrieve/update the data. As key properties, we monitor the average number of calls performed and received by the methods (call-tree-allthreads view of YourKit) to determine if these methods can be classified as potential controllers or data containers, respectively. Besides, we also check if such large number of calls impacts on the average usage of processor and memory (chartcpu-usage and chart-heap-memory-usage views of YourKit).
Tower of Babel (ToB) occurs when some data is translated into an exchange format, such as XML, by the sending process. This data is later parsed and translated into an internal format by the receiving process. This means that the system may spend most of its time processing the text when the translation and parsing of data formats are excessive. As key properties, we check all the methods showing a large execution time (call-treeall-threads view of YourKit), and then we inspect the names of such methods (cpu-hotspots view of YourKit). We check if there is a match with some specific keywords (i.e., "converse", "parse", and "translate"), thus to associate the performance overhead to the processing of exchange format data.
Empty Semi Trucks (EST) occurs when an excessive number of requests is required to perform a task. As key properties, similarly to the Blob-controller, we consider the number of calls performed by the methods (call-tree-allthreads view of YourKit), and we verify if the execution time of such calls shows a small coefficient of variation, i.e., the execution time follows a deterministic distribution (call-treeall-threads view of YourKit). This way, we aim to capture the peculiarity of this antipattern when inefficiently using the bandwidth and/or interfaces.
Excessive Dynamic Allocation (EDA) occurs when an application unnecessarily creates and destroys objects. As key properties, we monitor the objects collected by the Java garbage collector (method-list-garbage view of YourKit), and we want to recognize those situations leading to a performance overhead, i.e., the average memory utilization (chart-heapmemory-usage view of YourKit) is high.
It is worth remarking that our detection algorithms may be orthogonal, i.e., they may return the simultaneous occurrence of multiple antipatterns since they share the verification of some performance indices, e.g., the CPU utilization. This is not a drawback of the approach since it may happen that a performance issue may be caused by the simultaneous presence of multiple bad practices [60].

EXPERIMENTATION
This section is organized as follows. We first present the research questions (Section 4.1), followed by the description of the analyzed real-world applications (Section 4.2). We describe the experiment design (Section 4.3), and we discuss the obtained experimental results (Section 4.4).

Research Questions
The purpose of our experimental evaluation is threefold: (i) it shows that JPAD is efficient and accurate when applied to real medium/large-sized systems; (ii) it compares JPAD with state-of-the-art approaches on the detection of real-life performance issues; (iii) it provides empirical evidence on the usefulness of detecting antipatterns. In particular, we aim to answer three research questions: RQ 1 Efficiency and accuracy of the antipatterns' detection rules: Are the proposed detection rules efficient and accurate? Does JPAD efficiently detect antipattern instances in real-world case studies? Motivation. We want to provide developers with an efficient and accurate framework that points out performance issues, if any. By evaluating the efficiency of JPAD, we can understand whether developers are motivated to apply our approach in practice. RQ 2 Comparison with state-of-the-art approaches: Are the proposed detection rules comparable to other techniques in the literature? Does JPAD capture complex performance problems that are not recognized by available tools? Motivation. We want to compare our framework with state-of-the-art approaches that deal with the detection of performance issues. By comparing JPAD to other tools, we can study whether developers are motivated to use our framework. RQ 3 Implication of applying antipattern-based refactorings: What happens when an antipattern is solved? How does the number of detected antipattern instances for the refactored system change? What is the effect on the overall system performance? Motivation. The goal of our approach is to spot performance problems, so that developers are aware of possible shortcomings in some portions of the code. In this research question, we aim to answer at which extent our approach benefits developers interested to know if code fixings improve the system performance.
To answer the first two research questions, we analyze realworld Java applications under different loads, i.e., varying the number of clients and the execution time. For each combination of (i) number of clients and (ii) duration (in minutes), the application is profiled, data is collected, and JPAD is used to detect the software performance antipatterns. Moreover, to further stress the benefit of the proposed framework and to answer the third research question, we provide empirical evidence of its impact by solving one instance of detected antipatterns (for one of the analyzed systems), and showing variations in the number of detected instances and the system performance.

Analyzed Systems
We select five Java applications that are highly concerned with their performance and have been studied in prior research [46], [61], [62]. These applications show a different complexity (in terms of number of classes) and belong to different domains, see Table 4 for their main characteristics. The rationale for selecting these five subject systems is that they provide evidence of all the seven performance antipatterns, as later described in the experimental results (see Section 4.4). Specifically, systems from [46] do not include ToB and EST antipattern instances. We select one system from [61] to provide evidence on the ToB antipattern. To strengthen the analysis of the EST antipattern, we borrow the subject system used in [62]. Hereinafter, a brief description of the analyzed systems is provided.
CloudStore [63] is a free software synchronizing files between multiple devices. Its primary focus is on preventing data loss and unauthorized access. TeaStore [64] emulates a basic web store for automatically generated tea and tea supplies. It has been published in [65], and later largely used as a microservice reference system and test application. WebGoat [66] is a deliberately insecure web application maintained by OWASP and designed to teach web application security lessons. TrainTicket [67] is the largest benchmark for microservice architectures in the literature. It provides train ticket booking functionalities and is used for fault analysis and error prediction [62], [68]. OpenMRS [69] is a free medical record system for health care providers. It is a modular open-source web application used by over 40 countries to improve health care delivery in resource-constrained environments. Moreover, to compare JPAD with state-of-the-art approaches [23], [45], [53], we consider four real-life performance issues (see details in Table 5) of two other systems: Hadoop [70] is a framework which performs data processing in a reliable, efficient, high fault tolerance, low cost, and scalable manner. Cassandra [71] is a distributed NoSQL database management system; fault-tolerance on commodity hardware makes it suitable for mission-critical data. Table 5 reports the code commits (inducing and fixing the four real-life performance issues) that we analyze, according to the study on performance regressions presented in [23]. The motivation of selecting these specific four performance issues is the following. Hadoop commits are evaluated in [23] by means of multiple performance metrics (i.e., response time, CPU and memory utilization, I/O operations) and we focus on those issues that have been highlighted as particularly complex, since such issues are not predicted by any of the considered metrics. About Cassandra, there is only one issue that is not predicted by PerfJIT in [23], and this is why we concentrate our effort on investigating that specific issue. Summarizing, our investigation includes those specific four issues (triggering the analysis of nine code commits) since they are more relevant to conduct a comparison.

Experimental Setup
In the following, we discuss the design choices taken to run experiments and avoid biases in results.
System Workload Specification. To avoid biases when profiling applications, we test several workloads acting in the  considered systems for a different duration. The choice on the number of clients and the duration of load testing is not trivial, however our assumption is that this is decided by software developers that are aware of load conditions and system's dynamics. Subject systems in Table 4 run for 3, 6, and 12 minutes with 25, 50, 75, and 100 clients. A larger variation in the number of clients is considered for the issues listed in Table 5 and related to Hadoop and Cassandra; these systems are tested with 1, 10, 100, 500, and 1k clients running for 3, 6, and 12 minutes. Each combination of number of clients (C) and duration of the testing (D) represents an input we use in our detection. Such a combination leads to a system configuration that is labeled in the following as C À D, for instance the system configuration "1000-6" means to consider 1000 clients with 6-minute duration of testing. Overall, 300 different system configurations are analyzed to explore a variegate set of systems' characteristics. Load Test Definition. To avoid biases in the load testing, we explore benchmarks stressing different aspects (e.g., input/output operations, end-users services) of considered applications. For instance, we make use of available benchmarks for Hadoop and Cassandra (i.e., TestDFSIO [72] and Cassandra Stress [73]) stressing write and read operations. Locust [74] (i.e., a Pythonbased load testing tool) is adopted for systems reported in Table 4 since ready-to-use benchmarks are not available. We stress end-users services identified as crucial for the application, e.g., in TrainTicket we generate requests for monitoring the security service invoked when users make a reservation. As anticipated in Section 3.1, to support software engineers in the selection of system functionalities to be monitored, as a rule of thumb, we consider the CPU utilization of system software components. This analysis provides support to the component choosing stage of the approach that decides which component (s) to test, and they are given as input to our detection. As an example, let us consider Fig. 2  ResourceManager and NameNode show a CPU utilization lower than 8%, whereas the CPU used by SecondaryNa-meNode is even less than 2%. This denotes that it is not relevant to profile such components.
Load Test Execution. To avoid biases in the obtained results, all experiments are run on a single node cluster deployed on a dedicated machine (with a 2.80 GHz quadcore CPU and 16 GB memory) to easily monitor used resources and avoid misleading performance results due to background activities. Readings extracted from CPU, memory, and garbage collector snapshots taken for each configuration are fed to JPAD that automatically detects performance antipatterns.
Antipattern Detection and Thresholds Setup. We recall that most of thresholds are calculated through heuristics that consider average values and offsets, see Table 3. This means that each system configuration includes thresholds that vary when changing number of clients and duration. In the following, as illustrative examples, we report the thresholds values used with the lowest (25 clients, 3 minutes) and the largest (1k clients, 12 minutes) boundary values adopted in the proposed experimentation. Table 6 reports the threshold numerical values used to analyze the systems when setting the number of clients to 25 and the duration of the run to 3 minutes. In the first row of Table 6 we can notice that countTh threshold is calculated on average values, and it shows a large variation among the analyzed systems (e.g., 108 for OpenMRS, and 1198 for TrainTicket). There are other thresholds that vary less even if they are calculated with average values, e.g., msgTh varies between 1.40 and 1.58. Some other thresholds are instead fixed to values that usually are extracted from system requirements, e.g., memTh is fixed to 10% for all the systems. Table 7 reports thresholds  used for all commits of Hadoop (NodeManager and Data-Node) and Cassandra, when systems are loaded with 1k clients for 12 minutes. Note that some commits do not provide data needed to calculate the thresholds, e.g., 738d1a2 and ff8378e lack information for deriving callesTh related to the Blob antipattern disabling its detection. Besides, the same threshold may vary differently based on the considered applications. For example, callersTh related to the Blob antipattern changes significantly when considering the two Cassandra commits (i.e., 82.35 for 88d2ac4 and 44.28 for f93e6e3) and shows slight variations when observing each Hadoop component separately. Instead, relevant variations of this threshold are observed also for Hadoop when the two components are compared, i.e., it varies in the range 9.56-11.95 for the NodeManager and in the range 2.04-2.91 for the DataNode. Threshold and offset values are given as input to JPAD, and these values can be easily modified in case system stakeholders express their own performance requirements.
Summarizing, our experimental setup includes the following inputs: (i) system workload specification, i.e., number of clients and duration of the testing, (ii) load test definition, i.e., which component(s) to test, and (iii) threshold and offset setup, i.e., numerical values for antipatterns' thresholds and offsets. Our experimentation investigates the variation of these inputs and how they affect the detection accuracy and results, see more details in Tables 8, 9, 10, 11, and 12, and Appendix C, available in the online supplemental material.

Experimental Results
This section presents experimental results answering our three research questions (see Section 4.1).

Effectiveness of the Antipatterns' Detection Rules
To answer RQ 1 , Table 8 shows the software performance antipattern instances detected by JPAD for each system, when varying the number of clients and the monitoring duration. The last column of Table 8 shows the detection time (in seconds), i.e., the time required by JPAD to complete the analysis. This way, we aim to analyze the scalability of the tool when exposed to medium-and largesized systems.
We observe that the number of detected antipatterns may increase with the number of clients. For example, for the WebGoat case study, the number of detected antipatterns with 25 clients (i.e., 3, 6, 0, 5, 4, 0, 3) is smaller than (or equal to) the case with 100 clients (i.e., 4, 7, 2, 5, 4, 0, 4). The number of detected instances can also decrease for a larger number of clients, e.g., OpenMRS shows 2 instances of the CTH antipattern with 25 clients and 0 with 100 clients. However, we notice that this may lead to generate further instances of different antipatterns types, in fact with 100 clients we get 3 Blob instances instead of 2 (observed with 25 clients). Similarly, CloudStore shows 7 instances of the EP antipattern and 8 instances of the Blob antipattern with 25 clients. With 100 clients instead we get 6 instances of the EP antipattern and 7 instances of the Blob antipattern, at the cost of 2 instances of the EDA antipattern (not observed with 25 clients). This may be due to performance issues showing up to (previously not critical) parts of the system, i.e., affecting different system elements only when the number of clients increases. As a result, a larger workload may produce more instances of other antipattern types.
A longer experiment duration might reduce the number of detected instances due to the performance problems flattening/elevating in different parts of the system depending on the application peculiarities. In our experiments, OpenMRS benefits from a longer run time, whereas all other systems (especially TeaStore) generally show a constant or higher number of antipattern instances. The number of antipattern instances detected when OpenMRS is run for 3 minutes with 25 clients (i.e., 11) decreases when the same application is observed for 12 minutes (i.e., 3). For all other systems, it is unlikely that the number of detected instances decreases when the experiment duration increases. In a few cases, the opposite trend is observed, e.g., there are 21 antipattern instances when WebGoat is run with 25 clients for 3 minutes and 25 instances when the same application is run for 12 minutes. A further example is represented by the CloudStore system that shows 17 instances with 100 clients for 3 minutes and 19 instances when running for 12 minutes. From the collected results, no straightforward relationship between the number of clients and the number of detected antipatterns is observed. Similar observations are drawn when considering the experiment duration since such duration is not linearly related with the number of detected antipattern instances, thus to confirm the complexity in the performance analysis of Java applications. Table 8 also reports the time required by JPAD to detect antipatterns, and we can notice that it varies between 0.4 and 282.5 seconds (i.e., 4.7 minutes). The average detection time observed over all systems and loads is 45.9 seconds (i.e., less than a minute) with OpenMRS and TeaStore taking the shortest and longest average detection time (i.e., 0.6 and 194.4 seconds corresponding to 3.2 minutes), respectively.
Since TeaStore and WebGoat show the longest time to complete the detection, we further inspect these systems. For TeaStore we notice that Blob, ToB, and EST take longer than other antipatterns to be analyzed. This is due to the large size of the call tree (i.e., a key property analyzed to detect these antipatterns, see Table 1) of the TeaStore system. When Blob, ToB, and EST are excluded from the analysis, JPAD takes only 1.2 seconds on average to analyze TeaStore. For WebGoat, we observe that the detection of EST antipattern takes longer than others, i.e., the detection takes 3.5 seconds on average when EST is excluded. We think this is due to the large number of methods that are invoked (namely the callees) in this application. In fact, these methods must be analyzed by the EST detection algorithm to check their coefficient of variation, i.e., if the execution time follows a deterministic distribution, thus to spot inefficiency on the usage of resources.
Inspired by [46], we build a ground truth to investigate the accuracy (along with precision, recall, and F1 score) of our detection rules. The ground truth is constituted of all the hotspot methods (i.e., over approximation of the detection outcome) since JPAD analyzes their design and performance characteristics to detect antipatterns. We exploit the variation in the system workload specification inputs to decide if a hotspot method is correctly detected as source of performance problems. We determine that a hotspot method has performance issues if it is reported as a violation of antipatterns while analyzing different system configurations. If a hotspot method is detected as an antipattern instance in at least half (i.e., 6 over a total of 12) analyzed configurations, then such a method is considered as a positive instance of that antipattern, otherwise it is classified as a negative instance. Let us consider as example execute-Query() that is a hotspot method in OpenMRS. It is detected as EP antipattern in 10 (out of 12) analyzed configurations, hence such method is considered as a positive instance of EP in OpenMRS. Another hotspot method of OpenMRS is includeFragment(String, String) but it is detected as EP antipattern in only 2 (out of 12) analyzed configurations. Consequently, this method is considered as a negative instance.
The confusion matrix is built for each system configuration as follows. True positives (TP) are the methods for which an antipattern instance is detected by both the ground truth and JPAD. True negatives (TN) are those hotspot methods that are detected as specific antipattern instances by neither the ground truth nor JPAD. False positives (FP) are defined by calculating hotspot methods that are detected as antipatterns by JPAD, but are not in the ground truth. False negatives (FN) are calculated by counting methods that are not detected by JPAD but are in the ground truth. The confusion matrix for the analyzed system configurations in all considered applications (when detecting the Blob antipattern) is reported in Table 9. Confusion matrices derived for other antipatterns are omitted for the sake of space, but they are reported as part of replication data [28]. Accuracy, precision, and recall are defined for the detection of each antipattern and calculated for the analyzed system configurations as follows: A conf ¼ TP þTN TP þTNþFP þFN , P conf ¼ TP TP þFP , and R conf ¼ TP TP þFN , respectively. P conf and R conf are undefined when their denominators are equal to zero. Accuracy, precision, and recall (i.e., A, P , and R, respectively) are also derived for the considered applications by averaging the results calculated for each configuration (M conf ), i.e., where M ¼ fA; P; Rg and #conf ¼ 12, i.e., the number of analyzed system configurations in our experimentation. F1 score is defined as the harmonic mean of precision and recall, i.e., F 1 ¼ 2 Ã P ÃR P þR , and it is not computed in case P or R are undefined. Table 10 reports accuracy (A), precision (P), recall (R), and F1 score (F1) of the proposed detection rules. It is worth remarking that our calculations leverage the variation in system workload specification inputs since the ground truth (by construction) is an over approximation and TP, TN, FP, and FN values keep into account how these inputs affect the detection. The average accuracy (across the five systems) is well above 90% for all antipatterns. The average precision is also above 90%, the lowest value (i.e., 79%) is observed in OpenMRS for the Blob antipattern. The average recall is mostly larger than 90% except for WCS that shows a lower value (62%); the  The symbol "-" means that the metric is undefined.
lowest recall (58%) is observed in TrainTicket for WCS. The F1 score shows the lowest value (i.e., 88%) for Blob in OpenMRS as a reflection of the previous result on precision, even if average values are larger than 90% for all antipatterns.

RQ 1 : efficiency and accuracy
The proposed detection rules efficiently and accurately capture performance issues of medium-and large-sized systems. CTH, EP, and Blob are the antipattern types that occur in all the analyzed systems. EP shows the largest number of instances across all the system configurations. WCS is the antipattern type with the smallest number of instances. JPAD efficiently detects the instances of the presented software performance antipatterns, in fact the system configurations are analyzed, on average, in less than a minute. TeaStore shows a longer detection time, but in the worst case it is less than 5 minutes. The F1 score, derived from precision and recall metrics, is larger than 85% in all the considered cases, denoting accurate detection rules.

Comparison With State-of-the-Art Approaches
The goal of this section is to investigate if JPAD is able to detect a variation on the number of antipattern instances when comparing the code commits that are known from [23] to induce and fix real-life performance issue, respectively. Table 11 reports the performance antipattern instances that have been found in Hadoop across 7 different code commits. JPAD takes 0.31 seconds on average to analyze these configurations, and it detects EP and Blob antipatterns only, other antipatterns are not captured. Table 11 a focuses on the issue identified by YARN-4307 that is not predicted by state-of-the-art approaches [23], [45]. Interestingly, we can notice that the selected software component(s) show a different number of antipattern instances. The column reporting the fixing of the issue (i.e., 308d63f) always shows an equal or lower number of antipattern instances in all the cases for both the considered Node-Manager and DataNode components. For the NodeManager we can notice that there are some configurations (e.g., 100-3 and 100-6) where EP instances are not detected in the commit fixing the issue, at the cost of emerging Blob antipattern instances. There are some further configurations (e.g., 500-3 and 500-6) for the NodeManager where we can notice no variation for the EP, one instance is detected in all commits. Overall, commits inducing the issue (e914220 and 7af5d6b) show a total of 13 and 12 EP instances across all the analyzed configurations, respectively. The commit fixing the issue manifests less antipattern instances, i.e., 4 EP and 2 Blob instances. The DataNode component instead has a different number of EP and Blob instances. Specifically, JPAD detects a total of 14 EP instances summing up all configurations independently of the commit. 9 and 8 Blob instances are detected for commits inducing the issue (i.e., e914220 and 7af5d6b, respectively), and 4 Blob instances are found in the commit solving the issue. Table 11 b reports the results for the YARN-7102 issue. The commit fixing the issue (i.e., ff8378e) shows a behavior similar to the previous case, i.e., there are some configurations (i.e., 10-12, 100-6, 1000-3) of NodeManager for which EP is solved at the cost of a new Blob instance. EP instances in the DataNode rarely change (i.e., only for 1-3 and 10-3), whereas Blob instances are reduced in eight configurations (e.g., 10-3 and 10-6). Similarly to the previous issue, the DataNode component shows more variations for Blob instances than for EP instances. Summing up all the analyzed configurations there are 13 EP instances and 9 Blob instances for the code commit inducing the issue (528b809), against 11 EP instances and 1 Blob instance for the code commit fixing the issue (ff8378e). Table 11 c presents the results for the HDFS-12754 issue, and also here the number of antipattern instances is equal or lower when considering the commit fixing the issue, i.e., 738d1a2. Differently from previous cases, if EP instances are not detected then Blob instances do not arise in the Node-Manager component. Overall, 11 and 4 EP instances across all configurations are observed for decf8a6 and 738d1a2, respectively. For the DataNode component, we get 1 Blob instance and 11 EP instances in the commit solving the issue, whereas in the commit inducing the issue we found 12 EP instances and 9 Blob instances. Summarizing, JPAD detects the variation across different commits when calculating the total number of detected antipattern instances. Overall, we can notice that detected instances significantly decreases when comparing code commits which induce and fix issues. Besides, software components impact on such influence, the NodeManager shows, on average, less EP and Blob instances when issues are fixed. Instead, for the DataNode, only Blob instances are observed to reduce after fixing the issue, the number of EP instances slightly varies. This is due to the nature of the analyzed issues, in fact both YARN-7102 and HDFS-12754 are indicated in [23] as complicated performance issues (like deadlock), and EP captures that there is a large number of blocked threads (i.e., a symptom of a deadlock) leading to long execution time (see Table 1). Table 12 reports the performance antipattern instances that are found in Cassandra across two different code commits. JPAD takes 1 second on average to analyze these configurations, and it detects CTH, WCS, Blob, and EDA antipatterns, other antipatterns are not reported since no instances are detected. Between the two commits we can notice that all detected antipatterns show some decrease in their numbers when considering the commit fixing the issue, i.e., f93e6e3. Blob is the antipattern showing a larger number of instances, in fact commit 88d2ac4 shows 42 instances summing up all configurations, whereas commit f93e6e3 includes 36 instances. About WCS, we can notice that there is one configuration only (i.e., 1000-3) showing a decrease of antipattern instances, no major variation is observed for this specific antipattern. Both CTH and EDA show a considerable variation; looking at the total number of detected instances across all analyzed configurations of 88d2ac4 and f93e6e3 commits, we get 29 and 21 (17 and 11) CTH (EDA) instances, respectively. Hence, JPAD effectively detects a remarkable difference across the analyzed code commits (inducing and fixing real-life performance issues).
To answer RQ 2 , Table 13 summarizes results of comparing JPAD with state-of-the-art approaches. The last four columns of this table indicate if the specified tool can detect the considered performance issue. The column named PADprof refers to the framework presented in [53] which we test providing problematic snapshots (i.e., commits inducing the issue) and comparison snapshots (i.e., commits fixing the issue). All the analyzed snapshots show that no antipatterns are detected. Results for PerfJIT [23] and Perphecy [45] are instead extracted from [23] when investigating the detection of real-life performance issues.  [23]. JPAD Takes 1 Second on Average for the Detection of Antipatterns in These Configurations. RQ 2 : comparison with state-of-the-art JPAD overcomes state-of-the-art approaches [23], [45], [53] in the detection of some real-life performance issues. The proposed detection rules effectively capture complex performance problems that are not recognized by available tools. This consolidates the adoption of our framework as an alternative approach to support software engineers in understanding performance issues in Java applications.

Implication of Applying Antipattern-Based Refactorings
To answer RQ 3 , we refactor OpenMRS (i.e., the largest application among those considered in Table 4) to understand if the detection information provided by JPAD can support software engineers in solving performance issues. The selection of OpenMRS as target system for investigating the refactoring is also motivated by a recent paper [46] that analyzes the same system to locate performance regression root causes. Specifically, we focus on the OwaFilter method that JPAD detects as a Blob-Controller instance. Such a method is responsible for filtering the requests directed to protected endpoints, i.e., access is granted for authenticated requests only. Due to the modular nature of the application, requests come from different modules and the filter must check the URL of all incoming requests before granting access to authenticated users and forward their requests. Listing 1 reports a code excerpt of the OwaFilter. We can notice that there are several requests to be managed, for example: getRequestURL() (see line 9), getServletPath() (see line 12), getAdmnistrationService() (see line 14) to mention a few. There is indeed a match with the textual description of the antipattern (see Table 1) indicating that the Blob-controller occurs in case of a single class performing all the work of an application.
As specified in the literature [26], when solving a Blob-Controller antipattern, the refactoring consists of moving computation from the affected instance to a different one.
We delegate the verification of URLs to a centralized authentication system that forwards requests to the correct endpoint after the authentication process is completed. This way, the OwaFilter method must only check that users are authenticated. After refactoring OpenMRS, we evaluate its performance under all loads and compare the obtained results with those observed from the original OpenMRS version. To quantify performance improvements, we consider these metrics of interest: (i) the number of detected software performance antipatterns, (ii) the CPU utilization, and (iii) the system response time. It is worth remarking that our focus is on showing empirical evidence on the benefit of solving antipatterns, i.e., possible performance improvements that can be derived by detecting and removing antipatterns, and this is why we do not investigate further refactoring types or solutions. Table 14 reports the antipattern instances detected for the refactored system. Overall, compared to the original system, the number of instances mostly decreases, see the Diff. column where negative numbers indicates that the number of instances is decreased after the refactoring. For example, with 75 clients and 3 minutes of load tests running, JPAD detects 1 CTH, 2 EP, 1 WCS, 3 Blob, and 1 EDA instances in the OpenMRS original system. When applying the antipattern-based refactoring, we remove 1 instance of CTH, EP, WCS, and EDA, and 2 Blob instances. A similar improvement (see -2 entries in Table 14) is observed for CTH and WCS when there are 25 clients and the load test runs for 3 minutes. However, it is worth noting that some configurations (i.e., 25-6, 25-12, and 100-3) show more antipattern instances in the refactored system. For instance, for the configuration 25-3, the number of Blob instances increases in the refactored case. After further investigation, we find that interestingly this is due to the introduction of a Blob-DataContainer instance while solving the Blob-Controller antipattern. However, this is the only case for which Blob instances increase. Generally, the number of Blob instances is constant and decreases with 75 or 100 clients and 3 minutes of load tests running. The increment of antipattern instances (see +1 entries in Table 14) is observed for EP in two configurations and for CTH, WCS, and Blob in only one configuration. The number of EDA instances does not increase in any of the considered configurations.
Listing 1. Code excerpt of the OwaFilter method detected by JPAD as Blob-Controller antipattern instance.
The impact of the antipattern-based refactoring is also observed on performance indices of interest, i.e., CPU utilization and system response time of the OpenMRS Java application, results are shown in Table 15. The two performance indices are reported for both the original and the refactored OpenMRS system. The Variation column shows the observed performance change and is computed as: Variation = [ (Original -Refactored) = Original ] Á 100. When Variation > 0, the considered index is smaller for the refactored system than for the original one, meaning that the system performance has improved. Table 15 highlights with bold entries all those for which Variation is larger than 10%. The relevance of antipattern-based refactoring is shown by the general enhancement of the OpenMRS application performance (up to 50.47%, observed for the system response time). The configuration with 25 clients and 3 minutes of load tests running is the only exception to this observation. In this case, the response time of the refactored system is 25% longer than the one of the original system even if the CPU utilization is lower for the refactored system. This may be due to performance issues that are generated in different (and previously not critical) parts of the system. We already observed that the 25-3 configuration introduces a Blob-Data-Container instance, when solving the Blob-Controller antipattern. This might be the reason for the longer response time. As future work, we plan to further investigate the solution of antipatterns and possible implications in generating new instances.

RQ 3 : antipattern-based refactoring
Antipattern-based refactoring does not guarantee in advance neither a reduction of the total number of detected instances nor an improvement in the system performance. However, our experimentation shows that usually less antipattern instances are detected, and most performance indicators of interest improve. By refactoring OpenMRS, we find empirical evidence on the benefit of solving one antipattern: 12 less antipattern instances are detected; on average (across all the analyzed configurations), the CPU utilization is 18.90% lower and the system response time is 23.81% shorter. Maximum improvements for CPU utilization and system response time are 47.25% and 50.47%, respectively.

THREATS TO VALIDITY
Besides inheriting all the limitations related to the performance evaluation of Java-based applications [2], our approach exhibits the following main (construct, conclusion, internal, and external) threats to validity [75].
Construct Threats. This type of threat is observed when metrics deviate from the focus of the investigation. To smooth it, we provide a quantitative evaluation of the approach motivated by the research questions. We show that (i) detection rules work on real-world case studies, (ii) real-life performance issues are captured, and (iii) solving one antipattern instance improves the system performance.
Conclusion Threats. A threat of this type is related to the reliability of collected measures. To smooth these threats, we run all experiments on the same machine. Moreover, the profiling of the Java applications under analysis is delegated to the YourKit Java Profiler, a well-assessed and widelyused tool for this scope [76].
Internal Threats. We thoroughly test JPAD to spot errors in its implementation. For each experiment (whose setup can be easily changed by users), when an antipattern instance is detected, we verify if thresholds are violated. We recall that JPAD is publicly available [28] for inspection and to replicate experiments of this paper.
External Threats. We are aware that findings from our experiments may not transfer to different Java applications. To increase the external validity, we select software systems from different domains whose class number ranges from 68 to more than 1k. We also inspect code commits related to four real-life performance issues that are considered rather complex to be predicted by state-of-the-art approaches [23]. This way, we evaluate our approach against diverse applications so that our results may generalize to other case studies.

DISCUSSION
In this section, we discuss limitations of our approach that we consider as open issues paving the way for future research investigations.
Soundness and Completeness. Our approach currently detects seven software performance antipatterns experimented on five Java applications belonging to different domains, and nine specific commits of two further subject systems used in prior research to extract performance data [77]. Even if we demonstrate that our approach is able to recognize some performance issues that are not detected by other approaches in the literature (see Table 13), soundness and completeness are not guaranteed. To partially address this issue, a preliminary investigation is conducted experimenting (i) a set of four commits known to fix performance issues (see Tables 11 and 12) and (ii) an antipatternbased refactoring along with the consequent performance variation on utilization and response time indices (see Table 15). As future work we plan to strengthen this investigation involving practitioners in the evaluation of JPAD. Antipattern Specification. Detection algorithms reflect our interpretation of the textual description of software performance antipatterns provided in [26]. We are aware that further interpretations can be provided by different stakeholders (e.g., practitioners), and we leave as part of our future work the possibility of customizing detection rules and to provide a flexible framework that reflects multiple interpretations. More in general, we plan to introduce a domain-specific language for software performance antipatterns as support for users that may define their own detection rules. This way, we aim to strengthen the specification of antipatterns and to collect the experience of different stakeholders, possibly even discovering new antipatterns.
Profiling Overhead. The performance monitoring of Java applications is known to generate overhead [54], a comparison of different profiling tools and their overhead is presented in [78], [79]. In this paper, we use YourKit since both academia [53], [57] and industrial partners, such as Apple and Google, employ it as support for evaluating the performance of industrial and real-world applications. To partially cope with the overhead introduced by YourKit, all our detection algorithms include at least one threshold derived from offsets and average values. Offsets are independent of the absolute value of considered metrics, and they allow specifying thresholds based on values that already include the profiling overhead. This way, JPAD compares performance metrics and thresholds that are both affected by the profiling overhead. We leave as future work the investigation on the usage of other monitoring tools to compare (and possibly smooth) the profiling overhead.
Antipattern Thresholds. As argued in [80], thresholds must be set in software performance antipatterns to express performance requirements (when available), or to establish boundaries which represent the perception of different system stakeholders. In fact, users can differently judge the importance of performance requirements, e.g., the hardware utilization may be associated to monetary costs and more relevant for system administrators, whereas the execution time of a service is taken into account mainly by software developers. Therefore, JPAD provides the possibility to specify such thresholds, and this task is intentionally transferred to users that can decide which numerical values are more suitable for their purposes. We leave as future work the possibility of exploring further strategies, possibly synthesizing the need of different stakeholders.
Software Performance Testing. Test cases are often very important for an effective dynamic analysis [81]. Our approach delegates the test design to software engineers that may focus on general requirements and miss the relevant ones (from a performance-based perspective). Our experimentation highlights the importance of designing test suites (see Table 11), and demonstrates that such a selection can be guided by a preliminary analysis of the CPU utilization of software components. However, as future work we plan to investigate if approaches in the literature dealing with an efficient design of performance tests [82], [83] can be integrated in JPAD.
Antipattern-Based Refactorings. This is a very complex activity, and it is not guaranteed that the number of detected antipatterns decreases or the system performance improves. Our experimentation shows that solving one antipattern may generate other antipattern instances; more in general, antipattern instances can increase and the system response time can worsen (see Tables 14 and 15). Besides, the complexity is exacerbated by the possibly large number of detected antipattern instances, each matching with multiple code refactorings, and it is very difficult to understand which changes should be prioritized. In our previous work [33] we proposed a ranking methodology for the evaluation of architectural alternatives. We leave as part of future work to experiment ranking strategies on code refactorings thus to better investigate this aspect. Large systems may show the additional difficulty of being more sensitive to the impact of code changes, probably due to dependencies (among components) that need to be propagated when implementing refactorings. Antipattern-based solutions might be enriched with information about their effect (e.g., the involvement of dependent components) to identify which subsystems are involved in the refactoring process TABLE 15 Performance Variation (%) Obtained by Applying the Antipattern-Based Refactoring to OpenMRS and may trigger new antipatterns. To automatically fix the detected antipatterns, it is necessary that code refactorings undergo a verification process that guarantees their functional correctness.
Guidelines for Developers. When adopting JPAD in practice, we encourage developers to consider two different dimensions on the results they get as output. First, one can determine that a hotspot method shows performance issues if it is reported as a violation of some antipatterns. Second, when considering the different system configurations, the presence of the very same hotspot method (across many system configurations) contributes to the decision that such a method is indeed relevant for the performance issues under analysis. Both these two cases may indicate that such a hotspot method includes several design flaws and it indeed contributes to poor system performance.

CONCLUSION AND FUTURE WORK
In this paper we present JPAD, a tool-based approach to automatically detect software performance antipatterns in Java applications. The experimentation is performed on real-world Java applications from different domains, and JPAD captures four real-life performance issues that are not predicted by state-of-art approaches [23], [45], [53]. Results show the efficiency and accuracy of the proposed approach. The antipattern detection is executed on 300 configurations and we exploit such extensive experimentation to build a ground truth, thus to quantify JPAD accuracy. Overall, the accuracy is larger than 95% and the F1 score, derived from precision and recall metrics, is larger than 85% in the considered cases, leading to assess accurate detection rules. About efficiency, system configurations are analyzed, on average, in less than a minute, some configurations require more time and JPAD always takes less than 5 minutes to complete the detection of antipatterns. Besides, the number of detected antipattern instances substantially vary when experimenting software code commits known to induce and fix real-life performance issues. Antipattern-based refactoring turns out to be beneficial, the system performance improves up to 47% and 50% when measuring two specific metrics of interest, i.e., CPU utilization and system response time, respectively. JPAD points out system characteristics (e.g., number of times a method is invoked) that lead to performance issues, and its report includes quantitative information. This way, we aim to support software engineers in the task of taking decisions on which methods require more attention than others from a performancebased perspective.
Several research directions have been identified for future research. First, we want to extend the specification of antipatterns and make them flexible, i.e., users can add and modify detection rules to provide their own interpretation of antipatterns, possibly by introducing a domain-specific language. Second, we plan to extend the set of analyzed systems, possibly including case studies from the industrial domain to further assess both efficiency and accuracy. Third, we plan to extend JPAD to point out possible directions for antipatternbased refactorings, but the actual implementation of code fixings is delegated to software engineers who can assure the preservation of the business logic of applications. Moreover, the solution process is complex due to the number of detected antipatterns that may be large, as demonstrated in this paper, and it is difficult to select which antipattern to solve first. Hence, we want to investigate concurrent (or prioritized) resolution of multiple antipatterns, this may lead to inconsistencies due to conflicting solutions for which ad-hoc methodologies need to be defined.