Introduction
Today, software is an indispensable component of the majority of systems and integrated into the daily life of the society. With the advancements of technologies, such as open systems and highly automated or networked devices, software systems are becoming very complex [1]. Additionally, several people from different areas of expertise are usually required to be involved in a software project, which also increases its complexity level. Since software is developed by human beings, it is usual that people make mistakes; thus, in every commercial software some errors always occur [2], and as the level of complexity increases, then these error ratios become even higher [3]. Therefore, the errors that occur need to be detected and removed as soon as possible in the development process. In order to improve the quality of the software, various activities are performed under the heading of software testing. This process is an economically and technically vital component for a high-quality software product [2] and an integral part of the software development life cycle intended to produce more reliable and higher quality software products [4]. For systems in which there is zero tolerance of error, such as in medical systems and space missions that are directly related with human safety, as well as banking systems, the importance of ensuring a higher quality software development process becomes even more critical.
In the literature, there are a very high volume of studies that have been conducted on software testing from different perspectives. However, only a few early systematic review studies have analyzed research studies showing the trends, developmental stages and topics related to software testing. This information is extremely critical in creating a big picture of the software testing studies, which can guide decision makers, practitioners and educators in the field of software testing to improve their current systems, and thus significantly improve the quality of the software product [5]. These earlier systematic review studies were also conducted with a limited perspective. Currently, there is no study that aimed to analyze all major research articles conducted in the domain of software testing. Accordingly, this study aims to fill this gap by analyzing the articles addressing software testing to create a big picture of the domain.Considering this background, the methodology of the study was designed to investigate the following research questions (RQ):
RQ 1:
What are the bibliometric characteristics of software testing studies?
RQ 2:
What are the software testing strategies and themes?
RQ 3:
How do the trends of software testing strategies and themes change over time?
Background of the Study
Testing is defined as “an activity in which a system is executed under specified conditions, the results are observed or recorded, and an evaluation is made of some aspect of the system” (ISO/IEC 24765, 2006) [6]. In parallel to this definition of testing, a major task of the software development process, software testing is defined as the process of observing and demonstrating the behavior of a software system for compliance with its specifications [7]. As it requires several strategies and techniques with the involvement of several tools and resources, software testing is also considered as a complex task [8]. The background for this study is given below, summarizing the important role of software testing in the software development life cycle, potential impact of software testing strategies, and review studies conducted on software testing.
A. Importance of Software Testing in Software Development
Software testing covers several activities of the software development processes starting from the validation of initial requirements through to the acceptance of the end product by the customer [9]. Starting from the requirement specifications, the software testing tasks need to be planned and implemented in different stages of the software development process. Furthermore, software testing needs to be performed during different stages of the software development process for different purposes, such as the testing of the software product lines [10] and the graphical user interface [11].
Software testing is usually conducted in the three stages of creating, executing and evaluating the test cases [12], [13]; thus, the creation of appropriate test cases is critical [14], [15]. In other words, the appropriateness of test cases with software features, such as the technology used, the domain in which the software will be used, and the end-user skills is a critical factor in a successful testing process. Matalonga et al. defined the following seven elements to compose a test case: item (product/functionality under test), input (input variables that will stimulate the test item, output (response returned by the test item after receiving a test input), oracle (expected result, predicted behavior under specified conditions based on its specification or another source), result (comparison between the test output and the test oracle), environment (facilities, hardware, software, firmware, procedures, and documentation intended for or used to perform the software testing), and script (procedure specification for manual or automated testing) [16].
An analysis of the whole software development process reveals that the testing stage has the longest duration and is the most expensive phase [17] involving labor-intensive tasks [18]. As the software testing process is usually performed with limited resources under time constraints, currently, several research studies are being conducted to improve software testing techniques in order to obtain higher-quality and more reliable software products [19].
B. Potential Impact of Software Testing Strategies
Different software technologies require various testing methodologies and strategies. For instance, testing approaches on context-aware software systems [16], semantic web-enabled software testing [19], testing embedded software systems [20], mobile systems [21], or testing in service oriented architectures [22], [23] may require different strategies. Accordingly, several research studies have been conducted to improve the software testing methods and approaches specific to the technologies being used.
There is also a need for a better estimation of testing effort which may be related to the software technology and is important for completing its processes appropriately. For instance, as a result of a systematic literature review study, Kaur and Kaur reported that it was possible to improve the existing testing effort estimation techniques of mobile applications by weighting the specific characteristics and considering suggestions from experienced developers and testers [24].
Other studies and strategies are also needed for improving the testing process itself. For instance, test case prioritization approaches in regression testing [25], [26], designing the software testing processes [27], improving the regression testing costs [28] and using genetic algorithms compared to pure random testing [1], [2]. Deciding on the appropriate testing strategy for the testing process is another challenge [29].
To summarize, some heterogeneity and ambiguity exist among the different concepts dealing with testing methods and processes; therefore, Tebes et al. analyzed software testing ontologies to conceptualize software testing concepts, concluding that there was a lack of addressing non-functional software requirements and static testing terminological coverage where none of the ontologies directly linked functional and non-functional software requirements [30]. Accordingly, understanding the trends and development stages of software testing is critical for the development of conceptual models for software testing methods and processes. The literature on software testing provides a large number of studies, regarding both general and specific issues; however, among these studies, there are only a few reviews evaluating the trends and topics related to software testing, which are summarized in the next section.
C. Review Studies on Software Testing
Barmi et al. conducted a systematic review to better understand the connections between the specifications and testing requirements and reported that “Model-based testing” was the most commonly studied topic (26%), followed by “Formal Approaches” (24%), and “Traceability” (18%) and concluded that there was a significant gap between the specification and testing requirements [31]. Their study is considering the relationship between the specifications and testing requirements, not the whole process of software testing. In this context, Garousi and Mäntylä reported that over 101 secondary research studies (as a study of studies) had been published in the area of software testing since 1994, with model-based software testing being the most popular method, web-services the most popular system, and regression testing the most popular testing phase [32]. Since this was a ternary study, it has limitations in showing the whole picture of the software testing studies. Zein et al. performed a systematic mapping study in order to reveal testing techniques for mobile application and mapped 79 empirical studies to a taxonomy [33].
There are also several topic modeling studies conducted in the field of software engineering [34], [35], [36]. However, to the best of the authors’ knowledge, there is no topic modeling study conducted on the software testing area by using text-mining analysis and considering whole software testing processes though its literature from its early years to today. Using the methodology described below, this study aims to fill this gap of the literature and provide a larger picture of the software testing field.
Research Methodology
In this study, a semi-automated methodology was developed in order to analyze the empirical corpus consisting of the software testing articles. The methodology of the study was based on the implementation of Latent Dirichlet Allocation (LDA) [37], a probabilistic topic modeling algorithm used to discover hidden semantic patterns on the software testing corpus created in two consecutive stages.
In this context, the research methodology designed in accordance with the purpose of the study consisted of the following stages (see Figure 1). Initially, the experimental corpus of this study was prepared. Afterwards, the data preprocessing was applied to the corpus, which was followed by LDA implementation. Finally, interpretation and visualization procedures were conducted. This methodology is described in detail below.
A. Creation of Software Testing Corpus
Testing is a comprehensive concept related to the development of each system. In the software engineering discipline, testing is a crucial task of the software development life cycle. In contrast, software testing in any field other than software engineering can be considered as an end-user testing focused on the suitability of a software developed for a specific purpose in this field. For this reason, the multidisciplinary use of software testing makes it difficult to create a specific corpus of software testing studies in the scope of software engineering. In this context, to create a specific corpus of software testing within the scope of software engineering, a methodology including two sequential stages was followed for corpus creation, which included identifying core publication sources for the software engineering field and extracting articles specific to software testing.
From this perspective, firstly, publication sources (core conferences and journals) within the scope of the software engineering field were tried to be identified. As a result, the 44 core publication sources (28 conferences and 16 journals) specific to the software engineering field, which we identified in our previous study [38], were used as the data source for the creation of the software research corpus of this study.
After creating the software research corpus, the process of obtaining articles specific to software testing was carried out on this corpus. To extract the articles in this context, firstly, the keywords related to software testing were selected using an iterative process and keywords related to software testing including “test*”, “fault*”, “bug*”, “debug*”, and “defect*” were identified.
During this iterative process, we initially searched for articles with the term “test*” in the software research corpus and filtered them. Then, we examined the keywords in the filtered articles and found that the term “fault*” is frequently seen in them. Therefore, we added the term “fault*” as a second keyword to the search string. Then we searched for articles with the terms “test*” or “fault*”. We examined the keywords in these articles, and this time we added the frequently seen word “bug*” to the search string. We re- examined the keywords that appear frequently in these articles, and this time we added “bug*”, an another high-frequency term, to the search string. Finally, we added the terms “debug*” and “defect*” to the search string, repeating these sequential steps each time.
Then, the articles containing these five keywords (“test*”, “fault*”, “bug*”, “debug*”, “defect*”) in the title, abstract, and author keywords were searched. In conclusion, the search string was finalized by adding the time period (1980-2019) and language (only English) criteria of the articles. As a result, the final version of the search query was created as follows:
((EXACTSRCTITLE (“Empirical Software Engineering” OR “Information and Software Technology” OR “Journal of Systems and Software” OR “IEEE Transactions on Software Engineering” OR
This search string created with these criteria was employed on SCOPUS, a bibliometric database that indexes all journals and conferences selected for the corpus [38], [39], [40]. As a result of this search carried out on July 28, 2020, a software testing corpus was created containing 14,684 articles (9,205 conference proceedings, 5,349 research articles, and 130 review articles) published in English over the last 40 years. This empirical corpus contains only the title, abstract, and author keywords of each article because these sections best describe the characteristics of an article, such as purpose, method, conclusion, and scope [35]. The distribution of the numbers of these articles in the corpus by publication sources and years is given in the results section.
B. Data Preprocessing
Data preprocessing is an important task and critical step for the success of analysis based on text mining and natural language processing [41]. It renovates textual data into a form that can be predicted and analyzed more effectively so that machine learning algorithms can perform better [42]. In this regard, with the aim of preparing the software testing corpus for probabilistic topic modeling, a series of necessary textual data processing steps were respectively implemented on the corpus. As a first step, the word tokenization procedure was performed on texts in the corpus to separate the texts into single tokens (words). This was followed by the process of converting all text to lowercase. Subsequently, publication source titles, links, misleading words, special characters, and punctuations were removed. The stop words (is, and, a, an, the, of, for, etc.), which have a high frequency in English and do not make any sense alone, were also deleted [39]. The Snowball stemming algorithm [43] was applied to the remaining words to combine different variations of the words derived from the same root into a single root form. Moreover, with the intention of investigating the word phrases having high frequency in the software testing corpus, the N-gram based text categorization approach at word level was performed on the texts, and thus high-frequency phrases were identified as unigrams, bigrams, and trigrams [41]. Subsequently, each article in the empirical corpus was demonstrated as a word vector making available the numerical representation of the texts in the corpus. To conclude, a document-term matrix, which is the numerical matrix form necessary for topic modeling implementation, was created by combining these vectors [44].
C. LDA Implementation
Topic modeling is an approach that provides for semantic analysis and understanding of the themes in large collections that contain unstructured textual content [37], [44]. In this way, it offers perspectives for the analysis, modeling, understanding, and summarizing of huge collections, which include a large number of text documents. LDA is one of the widely used topic modeling algorithms and an unsupervised method for probabilistic topic modeling to discover groups of words called “topics” in a text document [37], [44]. In the LDA model, each document is assumed to consist of a collection of topics and each word in the document corresponds to one of these topics. These topics can be defined as a set of words that are frequently used together and often reveal a common theme. The topics discovered by LDA, represented by predefined word sets, are considered as a tool to best describe the entire document semantically [37], [44], [45]. In this study, the fitting and implementation of the LDA [37] topic modeling technique with Gibbs sampling [46] to the empirical corpus of software testing was achieved using the tmtoolkit package [47], an effective toolkit developed in Python that includes a wide spectrum of tools for text mining and topic modeling approaches. In order to fit the LDA model to the software testing corpus, the values of the prior parameters (
D. Interpretation and Visualization
The scope and consistency of the 42 topics and their temporal trends discovered by LDA were evaluated and interpreted at this stage, taking into account the background and dimensions of software testing, which was the context of the study. Each of these 42 topics contained top 15 descriptive keywords reflecting the characterization of the topics. Taking into consideration the first five of these keywords with the highest frequency, the topic labelling process was performed manually for each topic [35], [39]. Furthermore, the distribution percentage of each topic per document and the distribution percentage of the topics in the entire corpus were calculated, and the annual changes of these percentages were interpreted and visualized for each topic, and a taxonomy that reflects the evolution of software testing from past to present from a panoramic perspective was proposed.
Results
First, the results of the study are given descriptively to provide the general figure, followed by the topic modeling analysis, and temporal analysis.
A. Descriptive Analysis (RQ1)
Table 1 shows a total of 14,684 articles related to software testing that were analyzed in this study. The volume of articles published in each five-year period can be seen to continually increase.
Table 2 reveals the publication sources [38], their type as conference (C) or journal (J), number of articles selected from these sources (N), their percentages to the total number of articles considered in the corpus of software testing articles.
The data created using the procedures described in the research methodology section was analyzed first to understand the keywords’ unigram, bigram and trigram distributions. As seen from Table 3, the keyword “test” had the highest unigram ratio for all studied articles (67.64%) whereas “software develop” had the highest bigram ratio (12.12%) and “open source project” had the highest trigram ratio (2.65%).
B. Topic Modeling Analysis (RQ2)
Implementing the LDA-based topic modeling analysis, 42 topics describing the software testing strategies were found. The top 15 keywords of each topic and their ratio in the corpus are given in Table 4. The topic names are given by considering the first four keywords classified under each topic. The topics in Table 4 also illustrate software testing strategies, so the terms “topic” and “software testing strategies” are used interchangeably throughout this paper. These topics are listed in Table 4 according to their ratio among all corpus, with “Test Generation” having the highest ratio (5.85%) considering the number of articles published under this topic, and the lowest ratio (1.21%) belonged to “Security Vulnerability”.
C. Temporal Trends of the Topics (RQ3)
In order to better understand the temporal trends of the discovered topics and their temporal developmental ages, the percentage of each topic was analyzed in five-year periods. As a result, the percentage of the topics in the corpus (C%), and percentage of the topics in the same yearly period (Y%) are given in the table in Appendix-A. In addition, the average acceleration (AC) value for each year was calculated by subtracting the Y% of the previous year (percentage of the topics in the same year period) from that of the current year. Considering these yearly AC values, the five-year average acceleration values for each topic were then calculated and presented in Appendix-A. Furthermore, the overall AC values of each topic over the last 40 years are given in the last column of the table in Appendix A.
Considering these overall AC values (see Appendix A – AVG), we identified whether the acceleration values of the topics were positive (increasing) or negative (decreasing) from 1980 to 2019. Following, we illustrated the top ten topics with positive AC values in Figure 2 and the top ten topics with negative AC values in Figure 3. As seen in Figure 2, the topic “Prediction” had the highest acceleration (0.11), followed by “Empirical Evaluation” (0.10), “Source Code” (0.09), and “Bug Reporting” (0.09). On the other hand, Figure 3 revealed that “Programming Tools” (−0.21), “Language Specification” (−0.19), “Graph Algorithms” (−0.18) and “Database” (−0.18) were the top topics with negative AC values.
In order to visualize our findings given in Appendix-A and to provide a better understanding of the temporal changes in the trends of the topics, we presented acceleration graphs of the top ten topics with positive accelerations (see Figure 4) and the top ten topics with negative accelerations (see Figure 5). In Figures 4 and 5, the blue lines show the acceleration (AC) values calculated for each five-year period for that topic and the red line shows the linear trend-line that enables predictions of the near-future trend of that topic. Taking into account Figures 4 and 5, a number of implications can be drawn as to which software testing strategies will dominate and which will withdraw in the near-future. Temporal changes in volume and acceleration of other topics can be seen in Appendix-A.
D. The Latest Trends in the Topics (RQ3)
Due to the rapid paradigmatic transformations in software technologies, we specifically analyzed the trends in software testing strategies in the last 5 years from 2015 to 2019. In this way, the recent acceleration values of the topics during the last five-year period were also calculated and presented in Figures 6 and 7. Specifically, Figure 6 shows the top ten topics with positive acceleration values from 2015 to 2019. On the other hand, the top ten topics with negative acceleration values from 2015 to 2019 are given in Figure 7. As seen in Figure 6, interestingly, the topic “Prediction” had a significantly higher recent acceleration (0.75) compared to the other topics. Here, it should be noted that even the average volume of the topic “Security Vulnerability” was the lowest (see Appendix-A, 0.15) while its recent acceleration was one of the highest (see Appendix-A, 2015-2019, 0.30). A similar trend was observed for the topics “Open Source” and “Mobile Applications” where their average volume was lower compared to the other topics (see Appendix-A, 0.20 and 0.16 respectively) but their recent accelerations were the highest (see Appendix-A, 2015-2019, 0.29 and 0.24, respectively). On the other hand, as emphasized in Figure 7, the topics “Web Applications” (−0.23), “Fault Detection” (−0.22), and “Project Management” (−0.22) had the lowest recent accelerations.
E. Developmental Ages of Software Testing (RQ3)
With the aim of providing a better understanding of the developmental ages of software testing strategies over the last 40 years from 1980 to 2019, the top ten topics of each five-year period were identified and presented in Table 5. The newly included ones in the top ten topics in each period are highlighted in bold (see Table 5). In order to more clearly demonstrate the changes in software testing over the timeline of the last 40 years and to define its developmental ages, we visualized the top five topics in each five-year period and presented Figure 8. As shown in Figure 8, from 1985 to 1995, topics such as “Model Reliability”, “Programming Tools” and “Fault Detection” were in the top five list of all topics. The testing processes during this period can be considered as more programming environment-oriented and fault detection-based. Accordingly, this period was labelled as “Detection Age”, which can be considered as programming-oriented. After 1995, topics such as “Empirical Evaluation” and “Test Generation” became dominant; thus, the period from 1995 to 2005 was referred to as the “Generation Age” of software testing. “Testing Practices” then became one of the dominating topics, with the period from 2005 to 2015 being called the “Evaluation Age”. Interestingly, after 2015, the topic “Prediction” became one of the dominating topics, indicating a change in the field and was named the “Prediction Age”.
Discussion
This study analyzed the last 40 years of the software testing studies and provided several contributions to the software engineering field. These contributions are summarized below under six headings, namely systematic methodology for corpus-based topic modeling, the wide spectrum of software testing topics, insights into the methods and strategies, developmental ages of software testing, and the future outlook for software testing. Finally, the limitations and suggestions of the study are presented.
A. Systematic Methodology for Corpus-Based Topic Modeling
The first contribution of this study is the proposed two-stage corpus creation methodology, which was used due to the challenges in creating an appropriate search term for selecting articles related to the software testing domain. This two-stage corpus creation approach has not been reported in earlier studies, and accordingly this is a significant contribution to the corpus creation method of mapping studies. For certain specific domains such as software testing, the proposed methodology improves the existing approaches. As the corpus is very important for mapping studies, our corpus creation methodology is expected to improve future studies significantly.
B. Insights into the Methods and Strategies
In the software testing stages, the aim is to develop software-oriented products and services in a systematic and efficient manner, in which a wide range of tasks, methods, and strategies are used. Depending on the type, scope and context of the software designed and developed, the methods and strategies chosen during the software testing stages vary considerably. The findings of this study offer a wide-ranging insight into not only the themes and trends in focus but also the tools, tasks, methods, and strategies specific to software testing. Specifically, the discovered topics reveal that the most focused tasks in software testing are specification, transformation, detection, localization, generation, evaluation, optimization, verification, and prediction. The important background provided by the core tasks highlighted in this study for software testing has also been addressed by previous studies [49]. Likewise, the findings draw clear attention to methods and strategies, such as “Test Generation”, “Empirical Evaluation”, “Fault Localization”, “Regression Testing”, “Mutation Testing”, “Program Analysis”, “Bug Reporting”, “Algorithm Optimization”, “Event Tracing”, and “Product Line Inspection”, which are revealed as discrete topics. Among these topics, which also emphasize methods and strategies, attention is drawn to “Test Generation”, “Empirical Evaluation”, and “Testing Practices” as the top three topics having the highest percentages. Hence, results of these earlier studies and this current study validates each other.
C. Five Developmental Ages from Specification to Prediction
The results indicate that the formation of the topics in software testing started in 1980, which marks the year when IBM released the first personal computer on the mass market. From this date, since many users started to use software applications, testing became more critical. In the current study, starting from 1980, the developmental periods of software testing were classified under five developmental ages, namely specification, detection, generation, evaluation, and prediction. As indicated by Boehm, during the early 1980s, the testing process was mainly conducted on fixing bugs in the codes [50]. This is also confirmed by our results, and we named the period between 1980 and 1985 as the specification age of software testing. After 1985, defect detection and understanding the distribution of defects as well as building connections between defects and requirements were the concepts that influenced testing [51], [52]. Researchers started to develop methods to classify and mathematically model defects [53]. This situation is also supported by our results and indicates the impact of fault detection on testing from 1985 to1995, which we called the detection age of software testing. Starting from 1995, test generation concepts began to be introduced by researchers such as [54], and this was also supported by our results, indicating the generation age between 1995 and 2005. As also reported by van Dam, the steps for software testing were taken after 1998, indicating the beginning of the software development age in the software testing process [55]. However, a systematic implementation on testing procedures started after 1995. Then, after 2008, testing started to be considered as part of the software development processes [55], which may support the current study’s findings on the evaluation age. Currently, since time is critical for software projects, automation in software testing procedures was reported as very important, for which prediction is an inevitable part. Hence, studies on software testing indicate a trend toward better defect prediction and removal of bugs, and thus improving software quality [55].
D. Future Outlooks for Software Testing
In this study, considering the volume percentages of the topics, even “Prediction”, a younger topic that started to be studied more often, after 1995 (see Appendix-A), was one of the top five topics from 2015 to 2020 (see Figure 8). The average acceleration of “Prediction” (see Figure 2, 0.11) was also the highest among all topics, and its recent acceleration value was significantly higher (see Figure 6, 0.75) compared to the other topics where its trend-line also indicates a steady increase (see Appendix-A). Additionally, although their average volumes were lower (see Appendix-A, 0.15, 0.20 and 0.16 respectively), the topics of “Security Vulnerability”, “Open Source” and “Mobile Application” are also showing higher recent accelerations (see Figure 6) with an increasing trend-line (see Figure 4). These results indicate that in the next decade, the topics of “Prediction”, “Security Vulnerability”, “Open Source” and “Mobile Application” will dominate the testing studies. With the improvements in artificial intelligence studies, prediction in the testing process and improvements through the automation of the testing process can be expected to take place in the next decade [52], [56]. This trend shown in the current study is also supported by van Dam [55], referring to the possible impact of artificial intelligence on testing automation studies. However, this does not mean that there will be no need for software test engineers or software developers; rather, it is an indicator of their changing roles and how they work in a software life cycle.
E. Limitations and Suggestions
This corpus-based topic modeling study, revealing the emerging themes and trends in the field of software testing from a panoramic perspective, provides a starting point and a methodological understanding for more in-depth research into the software testing phenomenon. In addition to the findings that reveal this background, this study has some limitations. The empirical corpus created for this study contained only articles published in core publication sources. In this respect, further research is recommended to expand the findings of this study using a comprehensive content analysis that includes a wider range of publication sources. In addition, the proposed semi-automated methodology can be applied to different sub-contexts of software testing processes, such as parallel computing, mobile applications, web applications, and open-source software systems, and more specific inferences can be obtained. As a result, deeper studies, which include specific content analysis on software testing and other sub-contexts of the software engineering field based on automated text mining and topic modeling, should be encouraged. The methodology of this study, which focuses on corpus-based LDA topic modeling, can be supported by approaches with different backgrounds, such as Non-Negative Matrix Factorization, Probabilistic Latent Semantic Analysis, and Hierarchical Dirichlet Process.
Conclusion
The results of this study offer insights into software testing through the analysis of a rich corpus. This methodology can be applied regularly to analyze the trends and developments in the field of software engineering. Presenting regular feedback for the decision makers, educators, researchers and industry is critical to future-proof software testing and to take appropriate strategic decisions. The results of this study show a current trend through prediction, which is an indicator of the signals of a change in testing procedures and in the roles of software test engineers. Additionally, the findings of this study indicate an increasing trend for the topics “Security Vulnerability”, “Open Source”, and “Mobile Application”. The results of the study may provide valuable insights for the industry and software communities in order to be better prepared for the possible changes in the software testing procedures using prediction-based approaches. From this perspective, new research can be conducted to better understand this change and develop strategies for educators to better prepare future test engineers with the necessary skills, thus enabling the industry to adapt and develop their testing strategies by considering these signals of change, and for decision makers to consider this information for their future decisions.