Machine Learning for Software Engineering: A Systematic Mapping

Context: The software development industry is rapidly adopting machine learning for transitioning modern day software systems towards highly intelligent and self-learning systems. However, the full potential of machine learning for improving the software engineering life cycle itself is yet to be discovered, i.e., up to what extent machine learning can help reducing the effort/complexity of software engineering and improving the quality of resulting software systems. To date, no comprehensive study exists that explores the current state-of-the-art on the adoption of machine learning across software engineering life cycle stages. Objective: This article addresses the aforementioned problem and aims to present a state-of-the-art on the growing number of uses of machine learning in software engineering. Method: We conduct a systematic mapping study on applications of machine learning to software engineering following the standard guidelines and principles of empirical software engineering. Results: This study introduces a machine learning for software engineering (MLSE) taxonomy classifying the state-of-the-art machine learning techniques according to their applicability to various software engineering life cycle stages. Overall, 227 articles were rigorously selected and analyzed as a result of this study. Conclusion: From the selected articles, we explore a variety of aspects that should be helpful to academics and practitioners alike in understanding the potential of adopting machine learning techniques during software engineering projects.


Introduction
The software engineering (SE) industry is always looking for better and efficient ways of building higher quality software systems. However, in practice, the strong emphasis on time to market tends to ignore many, well-known SE practices. That is, practitioners often focus more on programming as compared to requirements gathering, planning, specification, architecture, design and documentation [128] âĂŞ all of which are ultimately known to greatly benefit the cost effectiveness and quality of software systems. Lack of human resources is often cited as the main reason for doing so. Herein lies the great potential for machine learning (ML) since its algorithms are proven to be most befitting to problem domains that aim to replicate human behavior. Hence, it stands to reason that human-centric SE activities should also benefit from ML [78].
The growing demand on agility and ability to solve complex problems in SE has already lead researchers to explore the potential of ML in this field. To date, ML has many demonstrated benefits in SE. Applications of ML for SE range from resolving ambiguous requirements to predicting software defects [235]. For example, Sultanov et al. [203] used reinforcement learning (a type of ML) on understanding the relationships among software requirements at different levels of abstraction. Their approach shows how ML can automatically generate traceable links between high-level and low-level requirements. However, ML is not a single technique but rather an assortment of techniques. The challenge of using ML for SE is thus not only finding the right way of modeling the problem but also comparing various ML techniques and their potential. For example, several researchers have explored software projects predictions in order to better estimate the time to market their software products. For this purpose, various ML techniques were used and compared, e.g., artificial neural networks (ANN), rule induction (RI), case-based reasoning, support vector machines (SVM), and regression-based trees [31, 50,193].
In many areas of science and engineering, such as image recognition or autonomous driving, ML has already revolutionized development. The applications of ML to SE is increasing in significance, which is evident through the exponential growth in the number of articles on ML for SE being published every year. Consequently, it is of interest to understand which SE life cycle stages benefit the most from this trend; or even to understand which ML techniques are most suitable for which SE life cycle stage(s). This leads to the motivation of conducting this systematic mapping study. This systematic mapping study provides a birdâĂŹs-eye view on the current state-of-the-art of the field and suggests the open areas of research where more primary studies are needed. This study also provides a classification scheme as a MLSE (machine learning for software engineering) taxonomy highlighting the key areas of SE where ML has proven to be promising. In terms of scope, it is important to note that we are not interested in the application of ML in software projects in general (this is already well established) but in the application of ML in support of SE life cycle stages, e.g., requirements engineering, specification, analysis, design, testing, or maintenance. While this article presents a first, comprehensive study on the general use of ML for SE, it should be noted that some specialized studies already exist, e.g., ML for automated software testing [58].
The rest of the article is organized as follows. Section 2 explains the research methodology and protocol followed in the study. Results of the study are discussed in Section 3. In the end, the study is concluded with addressing the threats to validity of the study and conclusion in Sections 4 and 5, respectively.

Research Methodology
This section describes how we obtained articles for our study, the key research questions, and how we systematically addressed them. We obtained the most relevant articles by employing an appropriate search strategy, formulating insightful goals and research questions, and devising a strong data extraction process. For this purpose, we have followed the research methodology described below, which is based on the updated guidelines provided by Petersen et al. [158] for the research protocol and the creation of the classification scheme. The guidelines represent the basic principles of conducting systematic mapping studies in the domain of SE. We used Mendeley 1 as our primary article management tool in this study. The timeline of this study is from the start of 1991 (the oldest relevant article we could found in the search was from year 1991) to the end of 2019 (we started writing this article in the start of 2020).

Goals, Questions and Metrics
It stands to reason that a systematic study is always directed and kept on track by following a strict research protocol in order to improve the quality and impact of the study. To achieve this, we followed the Goal, Question and Metric (GQM) paradigm suggested by Basili et al. [24]. The aim was to guide the study by specifying its goals, formulating its research questions and identifying potential metrics in order to have a systematic data extraction process. The metrics are later used as attributes (keywords) in the data extraction process (described in Step 6 of Section 2.2). In the following, we summarize the goals, research questions and metrics (underlined) of the study. The first three goals lead to the research questions discussed in the following subsection. Due to the descriptive and elaborative nature of the fourth goal, we decided to thoroughly discuss it in Section 3.

Research Protocol
A research protocol is essential to conduct an independent, objective study. It regulates the flow of research and maximizes the meaningful outcomes from the study. For this purpose, we have designed a research protocol that describes the elements of the study and is illustrated in Fig. 1. Following are the main steps of the research protocol.
1. Search query formulation: Our search query uses two element PICO search as advised by Petersen et al. [157]. The two elements of the PICO framework are the following: Problem 'P': (requirement, specification, design, model, analysis, architecture, implementation, code, test, verification, validation, maintenance) and Intervention 'I': (ML, deep learning). We have not considered Comparison 'C' and Outcome 'O' in order to limit the search spectrum and the broad scope of the study.
The search query was formulated in an iterative fashion in order to ensure highest evidence-based retrieval of articles. The query was applied to titles and abstracts of articles in five well known digital repositories: IEEEXplore 2 , ACM Digital Library 3 , ScienceDirect 4 , Springer 5 and Web of Science 6 . The search yielded a total of 406 articles. The search string used in all repositories was: ("machine learning" OR "deep learning") AND software AND requirement* OR specification* OR design* OR model* OR analysis OR architecture OR implementation OR code OR test* OR verification OR validation OR maintenance 7 All repositories, except Springer, returned the number of articles as shown in Fig. 1 corresponding to the search query applied only on titles. Springer initially yielded 4502 articles as a result of the query; however, most of these articles were quite irrelevant to the scope of our study even after applying filters such as "Computer Science" as discipline and "SE" and "Artificial Intelligence" as sub-disciplines to reduce the search space. The first author then went through the titles and abstracts of the articles (if the goal of the article is unclear from the title) and stopped the search process when the first page with all irrelevant articles was reached. This resulted into 44 articles.

Removal of duplicates: In
Step 2, we removed the duplicate articles from the database. After removal of duplicates (60 articles), the remaining pool of articles was left with a tally of 346.

Quality assessment process (QAP): In
Step 3, the articles underwent the defined quality assessment process in order to maximize the overall authenticity and quality of the study. The quality assessment process consists of a multi filtration method based on the guidelines provided by Kitchenham et al. [101]. In this method, random and equal set of articles are distributed among the participants of a study in order to mitigate any bias. The method comprises of a fourquestions checklist, where each question is answered using a defined scale as described in Table 1. The sum of scores for all questions can vary from 3 to 10, while 10 being the highest quality. Similarly, all participants of this study evaluated their particular set of articles by rating each article based on the questions mentioned in the checklist. The resultant scores were then accumulated and utilized in the following exclusion/inclusion criteria. The questions of the checklist and the scales used in this study are shown in Table 1.

Applying exclusion/inclusion criteria: In
Step 4, we apply exclusion and inclusion criteria to the pool of articles in order to further refine their quality. This process yielded 222 articles.

1.
Articles that were not relevant to the scope (i.e., Articles that were not addressing the context of applications of ML for SE (negating Q1 in QAP checklist)) of the study were excluded 2. Articles that were not available in full text format were excluded 3. Articles demonstrating poor empirical soundness, i.e., score lower than 5 (refer to the QAP in Step 3) were excluded Inclusion Criteria: Articles were then selected based on the following inclusion criteria.
1. Articles of more than a single page were included 2. Articles assigned with a minimum score of 5 or more out of 10 in the QAP were included 3. Articles that were peer reviewed were included 4. Articles that were entirely written in English were included

Backward snowballing process: In
Step 5, we applied backward snowballing [224] (further searches based on references in the existing articles of the pool) in order to ensure a broad spectrum of articles relevant to the scope. The process yielded five additional articles suggesting that the initial search and exclusion/inclusion criteria covered the scope of our study well. The tally now stands at 227.

Attribute extraction: In
Step 6, the first author of this study went through the abstracts and derived the main attributes from each article. If the discussion in an abstract was not conclusive, the author investigated the conclusion or even the full text of the article. Once, the attributes are extracted, the authors established initial set of categories, which were refined iteratively and then generalized in order to broadly cover the research area. The generalized attributes along with article references are maintained in MS Excel sheets referred to as the collection in this study.

Classification scheme In
Step 7, we define a classification scheme to ensure accurate assessment of attributes. The generalized attributes obtained were then sorted by the participants of the study based on the knowledge areas provided in SWEBOK [198]. During the article sorting process, certain articles were found to be equivocal. In such cases, we associated those attributes to the articles that received majority votes from the participants of this study. Please note that the knowledge areas mentioned in SWEBOK were not strictly used in the categorization but merely employed as a defining factor to provide a high level abstraction of attributes that represented the set of articles. To get a better understanding, a graphical representation of the workflow starting from the attribute extraction process leading to the classification scheme is shown in Fig. 2.
8. Systematic map The construction of the map comprises of a series of discussions among the participants of this study, which lead to the careful association of the facets with the high level attributes of the articles. To get a better understanding of the systematic map, in the following we describe its main facets.

SE Stage Facet:
The SE Stage Facet comprises of attributes on a higher level of abstraction showing partial relevance between knowledge areas of SWEBOK [198] and the extracted attributes.

Contribution Facet:
The Contribution types, such as tools, approaches, or algorithms, are derived from the articles in a fashion similar to the ones described in [157,22] and supplemented by our own perspective on the obtained set of articles.

Research Facet:
The Research types, such as evaluations and solutions, are derived from the work of Wieringa et al. [223], where the type knowledge refers to the articles expressing experiences and opinions of the researchers. Fig. 3 shows the resultant systematic map.

Map Evaluation
This section evaluates the systematic map by addressing the research questions discussed in Section 2.1.2. In order to get a better understanding, the questions are answered in line.
Q1.1 SE life cycle stages: This question relates to our classification scheme, which is partially based on knowledge areas involved in traditional SE as mentioned in SWE-BOK [198].
The SE stages and articles that fall into the corresponding stage are addressed in Fig. 4. 119 out of 227 (52%) articles belong to quality assurance and analytics. 39 out of 227 (17%) articles have focused on architecture and design. 21 out of 227 (9%) articles have addressed the implementation and requirements engineering stage each. 9 (4%) articles were focusing on the maintenance phase. Rest of the articles were not particularly focusing on any stage but were generally applicable to SE.
Q1.2 Applications of ML for SE: To address this question, we have developed a taxonomy based on the identified applications of ML for SE in order to characterize the obtained articles into appropriate categories. We named the taxonomy as MLSE (machine learning for software engineering). The taxonomy was devised following the principles mentioned in [51,212]. As aforementioned, we have consulted the knowledge areas in SE from the SWEBOK [198] and envisioned a hierarchical-based classification structure of the taxonomy. Each participant of the study analyzed the applications in their assigned set the articles and aggregated them based on the similarities as described in step 7 of Section 2.2. Subsequently, we have organized the applications of ML for SE as subbranches, which belong to five life cycle stages of SE (knowledge areas). The applications of ML for SE that come under corresponding SE life cycle stages along with the number of articles are briefly explained below. Table 2 shows the corresponding articles with respect to the classification proposed as a MLSE taxonomy as shown in Fig. 5. Following is a brief description of the elements of the MLSE taxonomy: The Requirements stage comprises of three categories.
• Requirements Modeling and Analysis (9 (4%) articles): Requirement Modeling and Analysis contains articles that are focusing on distinguishing ambiguous requirements, resolving incompleteness, correctness of requirements, etc.  • Requirements Traceability (6 (3%) articles): Requirements traceability contains articles that refer to the ML approaches that assist in linking requirements to code or other artifacts.
The Architecture and Design stage consists of three categories.
• Design Modeling (15 (7%) articles): Design Modeling comprises of articles in which software process/ services recommendation models have been proposed in order to facilitate the project managers in selection of the most suitable process model for their projects.
Apart from this, model smells and re-factoring techniques of object-oriented structures using ML have also been proposed in the articles.
• Design Pattern Prediction (4 (2%) articles): Design Pattern Prediction comprises of articles that primarily focus on recognizing design patterns in software through source code or user interface layout using ML techniques.
• Development Effort Estimation (20 (9%) articles): Development Effort Estimation refers to the effort estimation of software projects using ML techniques.
The Implementation stage has four categories.
• Code Clone/Localization/Re-factoring/Labelling (8 (3%) articles): Code Clone/Localization/Re-factoring/ Labelling comprises of articles that aim at finding code clones, specific piece of code in software, re-factoring of code or labelling of the code with the help of ML.
• Code/Bad smell detection (3 (1%) articles): Code/ Bad smell detection contains articles that focus on applying ML in order to detect code and bad smells in software source code.
• Code Inspection/Analysis (5 (2%) articles): Code Inspection/Analysis contains articles in which a ML technique is employed for the purpose of code reviews.
• Code/Program similarity (5 (2%) articles): Code/ Program similarity category refers to articles that identify specific pieces of code, which are similar between two or more software projects. Additionally, such articles distinguish between original and pirated/cracked software.
The Quality Assurance and Analytics stage has nine categories.
• • Software Analysis (10 (4%) articles), Technique Assessment (5 (2%) articles), Software Process Assessment (3 (1%) articles): Software Analysis, Model Assessment and Software Process Assessment contain articles that come under assessment and analysis of software and ML models using existing ML techniques.
• Verification and Validation (16 (7%) articles): Verification and validation category holds articles that specifically address prediction and verification of software reliability through ML.
• Testing Effort Estimation (4 (2%) articles): Testing Effort Estimation comprises of articles that address the amount of testing effort required in order to test a software system using ML techniques.
The Maintenance stage has three categories.
• Software Maintainability Prediction (3 (1%) articles): The category of Software Maintainability Prediction holds articles that employ ML technique in order to assist the prediction of maintainability metrics appropriate for specific software projects.
• Software Aging Detection (5 (2%) articles): Software Aging Detection comprises of articles that use ML in order to detect software maturity and its aging in terms of resource depletion such as memory leaks, high CPU usage, and overtime.
• Maintenance Effort Estimation (1 (0.4%) article): Maintenance Effort Estimation contains articles that estimate effort required for the maintenance of a software system using ML.

Q1.3 ML type and techniques:
The purpose of this question is to understand which types of ML are being employed in the selected articles. As shown in Fig. 6, 162 Fig. 8, mostly ML techniques were employed to solve problems related to the Quality Assurance and Analytics stage. Decision Trees were again the most commonly used technique here (23 articles), followed by Support Vector Machine (19 articles). Random Forrest and Naive Bayes were next in line with 17 articles apiece. Artificial Neural Network, which was used in 12 articles in the Quality Assurance stage was also a subject of interest for the researchers working in the Architecture and Design stage (8 articles). Although, all the ML techniques have certain pros and cons but the selection of the most suitable technique depends on the type of dataset being constructed or employed. In general, decision trees appeared to be highly employed among the articles due to its simplicity and strong classification and regression capabilities [9, 65, 16].
Q2.1: Contribution facet of the articles: The contribution facet addresses the novel propositions of the articles. This represents the current state-of-the-art and enables researchers and industrial practitioners to get an overview of the existing tools and techniques in the literature. As shown in Fig. 9, 97 out of 227 (43%) articles focused on approaches/ methods, followed by 54 (24%) articles proposing models/ frameworks, 23 (10%) articles focusing on comparative analysis of existing techniques, 12 (5%) articles focusing on tools and 6 (3%) articles focusing on algorithms/processes. Rest of the articles -35 out of 227 (15%) -reported no new propositions. These articles were either investigating existing approaches, performing comparative studies, discussing opinions, or reporting their experiences. Table 3 shows the names of the propositions along with the contribution facet and references of the articles. Interestingly, only 23 out of 227 (10%) articles have explicitly named their propositions.
Q2.2 Research facet of the articles: The Research facet describes the nature of articles in terms of their purpose of conducting the research. Fig. 10 shows the articles by the  The evaluation facet represents the type of evaluation that has been performed in the articles in order to evaluate the propositions. The articles by the evaluation facet are shown in Fig. 11. Controlled Experiments have been performed in 130 out 227 (57%) articles followed by Case Studies in 46 out of 227 (20%) articles and Surveys in 14 out of 227 (6%) articles. 2 out of 227 (1%) articles have employed both a controlled experiment and a case study for an empirical evaluation; whereas, rest of the articles -35 out of 227 (15%) -did not use any empirical method for evaluation purposes. Moreover, we found no article employing ethnography or action research as empirical methods for evaluation. Among the articles those performed control experiments, 63 articles proposed approaches/techniques/methods and 36 articles proposed models/frameworks.  [1,13,19,26,27,28,30,35,38,39,40,48,54,59,65,75,79,86,92,93,94,97,103,107,111,112,120,124,144,159,160,161,162,163,170,177,189,191,197,196,202,204,206,209,216,218,229,233,237,239] 228,34,64,78,80,95,115,128,129,141,152,173,184,188,214,219,227] Q2.3 Datasets: This question refers to the datasets that have been used in most of the articles in order to evaluate their proposed approaches or comparative studies. Evidently, wide spread of articles employed JAVA applications followed by repositories made publicly available by NASA 8 . StackOverflow 9 , Github 10 and Promise 11 repositories have also been addressed in various studies. Fig. 12 shows the word cloud for datasets that have been most commonly used in the articles. The size of the terms indicates their frequency in the articles. The greater the size, the more number of occurrences (appearances) in the articles.
Q3.1 Trends in terms of year: This refers to the trends in terms of publication years of articles. It shows the evolution of the adoption of ML for SE. As shown in Fig. 13, the use of ML for SE is consistently growing. One can also ob-serve an exponential growth in this trend from 2016 -2018, where 2018 proved to be the highest publication year with 63 (28%) publications. In 2019, we recorded relatively less publications: 45 out of 227 (20%). There could be two plausible reasons for that. Either some articles are still in press (as this study was conducted in the start of 2020) or like any hype cycle, the peak of inflated expectations regarding ML for SE was reached in 2018 and now the trend is slowly going towards the trough of disillusionment.

Discussion
This section relates to the fourth goal of this study (G4) and deals with implications and analysis of the aforementioned articles. Here, we elaborate the challenges, limitations and future directions in this field.
Quality Assurance and Analytics (52%), being the SE stage with the most number of ML-related articles, shows that software quality is of prime focus for the researchers, while Architecture and Design, Implementation, and Requirements stages being the second and third highest targeted stages, respectively. Quality Assurance, Design, and Requirements  are indeed human-centric stages of the SE life cycle and the high number of articles highlight the fact that ML is able to address the problems in these area. To get a better understanding of the distribution of articles, we classified them as a MLSE taxonomy. The proposed taxonomy helps in understanding the general categories, which encapsulate the applications of ML specifically aiming at facilitating SE stages in literature. This also shows which stages are being covered the most and which (might) need more exploration. As can be seen in Table 2, Fault/Bug/Defect Prediction has been the major focus as most articles emphasized on it. We believe  Table 2, one can figure out that the Maintenance stage has been the least interesting area for the researchers. We encourage researchers to investigate how ML can be used to automate certain tasks in this area. We further encourage researchers to adopt combinations of ML techniques and use diverse datasets from different sources in order to train the ML models so that the applicability of the techniques can be generalized as also observed in [116,131,190,197].
We figured out that only 4 out of 227 (2%) articles used reinforcement learning as shown in Fig. 6. This implies a little interest of researchers in the applications of reinforcement learning to SE. Reinforcement learning has proven to be beneficial in solving complex problems specially in healthcare, business and robotics [67]. Thus, we believe it would be an interesting area to explore in terms of facilitating SE. Our findings also show that simple neural networks (39 out of 227 (17%)) and shallow neural networks (containing one or more hidden layers) (35 out of 227 (15%)) are the most widely used ML techniques in SE, in general. Moreover, Boosting, Naive Bayes (NB) and Case-Based Ranking (CBRank) techniques were popular in requirements engineering, particularly.
192 out of 227 (85%) articles suggest that evidence-based research is a focus of researchers of this domain. Moreover, the high number of controlled experiments (130 out of 227 articles (57%)) implies that the propositions are being compared to the benchmarks and overall the research is progressing evidently. The demographics also suggest that the interest of the researchers is rapidly growing in this area.
Addressing the fourth goal mentioned in Section 2.1.1, many researchers also reported the uncertain and stochastic nature of their approaches, and the difference in the captured data and results, e.g., difference in the deep learning model output values when executing it multiple times over the same input data [35,59]. Researchers also found that the availability of sufficiently labeled and structured dataset is quite a challenge [106,107,170]. Moreover, imbalanced sizes of software projects and datasets were also pointed out to be major obstacles in evaluating the techniques empirically [70,207]. Lack of generalizability and overfitting problems appeared to be the highest limitation in the articles as the ML models have shown fewer results when applied to diverse cross-project datasets [122,144]. Future directions include improvement of precision while maintaining recall in ML models [70]. Researchers also emphasized on improving prediction accuracy of the ML model by conducting more experiments using larger numbers of datasets and software applications [116,131,190,197]. Furthermore, evaluation of similar studies with alternate ML techniques are suggested by researchers, which can further strengthen the knowledge base in terms of prediction capability [11,48,73,186].

Threats to Validity
Similar to other secondary studies, the study is also prone to some validity threats. The threats and their mitigation strategies are described in this section.

Internal Validity
The extraction of articles and choice of repositories constitute a threat to internal validity. Moreover, the screening of articles and the risk of our bias also make the study prone to this type of validity threat. To overcome the internal validity threat, we ensured that our search strategy yielded relevant articles through an iterative refinement of the query. Each article was reviewed by the first author of this study, which may lead to a threat to the reliability of the results. This threat was reduced by double checking the article by the second author. In order to prevent the risk of bias, the articles underwent our defined QAP in a randomly distributed fashion. The fewer additional articles found through snowballing suggest that we succeeded in devising a robust query.

External Validity
We believe that the wide scope of our query formulation and the stringent exclusion/inclusion criteria has yielded a wide variety of articles that represent a significant and sufficient part of the research area, thus eliminating the generalizability threat to a significant extent.

Construct Validity
The adopted research methodology and protocol, and the data extraction process followed in this study is entirely based on established secondary study guidelines, such as [101,157,158], which reduce the threat to construct validity.

Conclusion
The conclusion of the study is manifold. We have provided an overview of the state-of-the-art in the area of machine learning for software engineering by evaluating carefully selected studies. We also proposed a classification scheme in the form of the MLSE (machine learning for software engineering) taxonomy that highlights the overall applications of machine learning for software engineering in terms of SE life cycle stages. The taxonomy shows the primary focus of researchers towards specific stages. This observation is one of the major contributions of this study. This study also reveals that the quality of primary studies in the domain of ML and SE is evidence-based with respect to the techniques being empirically evaluated by the researchers. Although, this research area is still showing an upward trend in terms of number of publications, further primary studies need to be conducted to emphasize on other lesser explored SE life cycle stages such as requirements engineering, maintenance and cost estimation.
The challenges faced by the researchers and reported in the articles should motivate and further guide researchers. These challenges also indicate the presence of known and unknown obstacles that researchers have come across or have not been able to solve while conducting their research âĂŞ implying a potential in ML for SE with obstacles in term of usefulness. Limitations pointed out by the articles show an inclination of not having enough resources or not being able to overcome certain aspects present in the domain considering the domain is still in its infancy. We also believe that this study provides the necessary impetus and further motivation to explore areas, which have been given lesser attention till date.