Verifying Agile Black-Box Test Case Quality Measurements: Expert Review

Tests on software are performed to ensure it has no defects. Software testing companies and organizations claimed that the testing problem increased their time and budget for testing by over 50%. As a result, the product’s release is delayed, which attracts customer complaints. Testing involves test cases, which are a vital part of the process. Therefore, quality test cases likely to identify defects and meet users’ requirements must be chosen. This paper aims to identify and verify the characteristics, sub-characteristics, and metrics that construct the proposed model for producing high-quality black-box test cases in an Agile Software Development environment. An expert review approach was used to verify the proposed model. Ten academic and six industry experts contributed to this verification. The results showed that six characteristics, 22 sub-characteristics, and 56 metrics are accepted to be included in the proposed model. Furthermore, findings revealed that the proposed model is comprehensive, consistent, relevant, and well-organized for Agile projects.


I. INTRODUCTION
With the growth of technology, companies have become increasingly dependent on software solutions for their business purposes and operations [1].As a result, companies are constantly optimizing their processes to increase the efficiency of software development and testing to fulfill stringent business requirements [2], [3].Software quality has become the most important criterion in selecting the software to be adapted or adopted in any organization, and it is one of the success factors of IT organizations [4].
During software development, it is essential to ensure the quality of the software and add business value to organizations by performing software testing [5], [6].Software testing is applied to ensure that software is released without defects [5], [7], [8], [9].Previously, Tricentis [10] reported that software testing companies revealed more than $1.7 trillion in financial losses in 2017 caused by software failures.
The associate editor coordinating the review of this manuscript and approving it for publication was Mahmoud Elish .
Testing problems such as time-consuming, high costs, software failure, and errors that cause extra expenses during maintenance can lead to poor quality software [7], [11], [12], [13], [14], [15], [16].Testers and organizations claimed that the testing problem might cause more than a 50% increment in their time and budget for the testing process [7], [17].Furthermore, the test case (TC) design is time-consuming and takes up between 40% and 70% of the overall time spent on testing [18].Therefore, numerous TCs need to be minimized by choosing the test suites with the greatest likelihood of uncovering the bugs to save time and resources [19].But reducing the TCs may not solve the testing problems; even with a bit of effort in testing to deliver exemplary programs, bugs are still prevalent in many projects [16], [20], [21], [22].The root causes of these testing problems were traced caused by the poor quality of TCs [13], [15], [16], [20], [21], [23].Hence, it is essential to get good quality TCs that have a high chance of exposing defects, more accurate results, the ability to increase the system's performance, lower cost, and meet the user requirements [18], [24], [25], [26], [27], [28], [29].
As already established in literature from the Agile manifesto, fast delivery is one of the Agile values.Besides that, customers' satisfaction with the working software and response to their changes lead to critical and significant challenges in Agile testing [35], [36], [37], [38], [39], [40].Accordingly, poor quality of TCs leads to undetected defects, time-wasting, resource-consuming, insufficient quality assessment, and bad quality testing, which produces a terrible product [7], [14], [15], [16], [17], [19], [21], [41], [42], [43], and [44].Therefore, additional effort is required to solve the problems; as a result, extra time is needed.This can delay the product's release, which attracts issues for customers.As a consequence, this condition explains why the test case quality (TCQ) issue has attracted researchers, as demand for working software with good quality and rapid delivery has become more competitive for organizations due to the volatile nature of software testing [41], [45], [46], [47].
Numerous studies have been consistently conducted to address the challenges associated with TCQ [15], [21], [24], [31], [48], [49], [50].Despite the extensive efforts invested in testing, including significant time and resources, the persistence of prevalent issues, such as the proliferation of flawed TCs, continues to impact the cost and duration of software projects [21].However, the current practice of TCQ solutions is still insufficient in Agile testing practices, especially when there is a demand for global quality software on time [34], [51].Considering this situation, this should warrant further research to write TCs in high quality by identifying aspects that lead to poor quality and defining the characteristics that improve the quality of TCs.
This study intends to improvise and enhance Lai's proposed TCQM model [15] by constructing the Goal-Oriented Measurement of Test Case Quality Assessment (GOTCA) Model.This model should also serve as a guide to practitioners and experts in understanding and executing TCs in accordance with the requirements of Agile Software development (ASD).The authors conducted a previous study to investigate the initial characteristics and sub-characteristics of TCQ.This paper identifies and verifies the characteristics, sub-characteristics, and metrics via an expert review approach.
This paper explains the expert review of TCQ in the ASD environment.Section II presents the related work of TC quality.Section III details the process of the expert review approach.Section IV describes the study results and the paper's conclusion at the end of the paper.

II. RELATED WORK
Testing problems such as time, costs, failed programs, and errors that cause extra expenses during maintenance [14], [15], [16] can lead to poor quality software [30], [31], [32], [33], [34].Therefore, there is a need to focus on the cornerstone of the testing process, which is a TC.A TC consists of predetermined conditions, inputs, and expected outcomes designed to ascertain the accurate implementation of the specified portion of the test item [52], [53].TCs are prepared to give complete thought to the operationality of the application [41].The TC should be good enough to produce a good testing process.According to Burnstein [54], there are many reasons to develop quality and practical TCs.Such as the high probability of detecting defects, organizational resources being used more efficiently, the chance of test reuse being increased, closer adherence to testing and project schedules and budgets, and increasing the quality of software products.TCQ is very important for assuring the quality of software-intensive products [31], [55].TCQ refers to the suitability of a TC to check the system quality, which is difficult to analyze [56].A quality TC can be defined as a TC that (1) has a high chance of exposing defects with a minimum effort, (2) has more accurate results, (3) can increase the performance of the system, (4) can decrease the cost, and (5) has a high chance of discovering unknown defects [25], [26], [27], [29], [57].In addition, quality TCs produce simple, well-designed, continually refactored, and maintainable tests [58].A few studies are concerned with TCQ.The following are some related studies.

A. TEST CASE QUALITY MODELS
Zeiss et al. [59] present a test specification quality model.They refer to a test specification as a set of TCs with similar pre and post-conditions, which is not the same definition in ISO 29119 [60], i.e., the documentation of one or more TCs.Zeiss et al. [59] designed their model based on ISO 9126.They argue that using a specific quality model for test specifications is more convenient since they changed the names of some characteristics and interpreted them differently.Hence, they adapted their proposed quality model from the ISO 9126 standard.They instantiated his quality model for test specifications written in the TTCN-3, used as the standardized language for the specification and execution of large test suites [61].
Their proposed model encloses characteristics, subcharacteristics, and metrics adapted to TTCN-3 based test specifications.Therefore, most defined metrics refer to aspects of the TTCN-3 language.Besides, Zeiss et al. [59] give no interpretation of the values, i.e., no threshold values or values to aspire to are defined that indicate good quality.Furthermore, this model does not consider ASD and is strongly based on the ISO 9126 standard.Thus, this model is unsuitable for assessing TCQ in the ASD environment.
Another similar study related to the quality model of TC specifications was conducted by Juhnke et al. [48].Juhnke identified seven quality indicators of informal TC specifications in the automotive domain from 816 TC specifications as a case study specified by an OEM and suppliers.They conducted another study [62], identifying nine TC specification challenge groups.Juhnke improved his model based on the ISO/IEC 25010 structure, and the terms of some character-istics also referred to this standard.Each characteristic has sub-characteristics, and GQM also defines metrics to measure it.However, the model focuses on automotive-specific TC specifications; thus, the model maybe not be suitable for any other domains.Furthermore, the study did not clearly focus on ASD environment.Jovanovikj et al. [63] extended the model in [59] to evaluate and monitor TCQ.However, Jovanovikj et al. relied on the ISO/IEC 25010 quality standard instead of ISO 9126, which was used in [59].They presented four quality characteristics, ten sub-characteristics, and six attributes as answers to the questions derived using the GQM approach.To verify their proposed approach and tool practically, they conducted interviews with some testers and quality managers and a case study conducted in the context of natural language tests.Nevertheless, their model needs to be improved.As they reported, many characteristics must be included in the model, and many quality attributes need to be defined to evaluate the TCQ very well.Moreover, their model completeness needs to be considered.
Adlemo et al. [32] used some of Kaner [64] attributes and Atmadja and Shuai [65] in their project OSTAG (Ontologybased Software Test cAse Generation) for identifying and ranking the TCQ characteristics.Adlemo et al. [32] recognized and organized about 15 characteristics of good TCs based on testing experts.Unfortunately, these characteristics are not associated with related measures for measuring TCQ.Adlemo et al. [32] just identified and ranked the characteristics without describing how to measure them.
In addition, Bowes et al. [49] tried to answer the question ''How good are my tests?''by presenting a list of 15 testing principles that capture the essence of testing goals and quantify them as test quality indicators.Bowes et al. stress the importance of correct, maintainable, and simple tests in their conclusions.They argue that TCs should be simple and meaningful to improve readability, especially in Agile.Nevertheless, they focused on smell tests in unit testing and reported that some of their criteria were not implemented.Furthermore, these principles were identified for white box TCs.
Kochhar et al. [21] provide 29 attributes from practitioners' reviews of good TCs.They grouped it into six characteristics: Contents, size and complexity, coverage, maintainability, bug detection, and others.Although this study provides a large list of the attributes of TCQ, it did not provide how to measure them, and it is not clearly defined for an Agile environment.
Similarly, in [31], Tran et al. studied TCQ from Sweden's professional developers' and testers' perspectives.They identified 11 characteristics of the quality of manual TCs.In the same way, although this study was conducted very well, it failed to measure these characteristics.Tran et al. tried to solve this drawback in a recent study [24].
They developed a comprehensive test artifact quality model.They used ISO/IEC 25010 standard and SLR to identify features and measurements of test artifact quality, where they identified 49 relevant studies.They categorize these features into nine characteristics.It has 30 sub-characteristics in different aspects and 18 measurements; however, it is not specific for black box testing and is not explicitly intended for the ASD environment.This model also has drawbacks, such as (1) nineteen attributes without any information, (2) 12 out of 18 measurements without descriptions, (3) the model is not evaluated yet, and (4) the 20 attributes from ISO are needed to investigate, as they reported.
The quality models mentioned above did not focus on TCQ in Agile projects.The following paragraphs present the quality models related to the TCs in the ASD environment.
Kayes et al. [41] proposed a metric called Product Backlog Rating (PBR) to measure and monitor the testing process in Scrum, but this metric needs further evaluation.It is focused on the testing process, not on the TCQ.In another study on the Scrum method, Aamir and Khan [66] proposed an improved quality scrum model by performing start-of-the-art testing activities.They account for a testing backlog to sustain TCs and deliver quality work.They surveyed experienced IT professionals working in companies that used Scrum.The survey highlighted the importance of adaptation and management of quality control activities in Scrum, such as TC planning, documentation, and execution.However, the study focused on the product backlog quality to enhance the product quality without referring to the TCQ, which is used to catch defects.Unudulmaz and Kalıpsız [67] and Harichandan et al. [68] proposed models to improve the scrum process; unfortunately, they did not focus on TCQ.
Whereas Fischbach et al. [23] identified 16 quality characteristics for six Agile test artifacts, they focused on the Agile test artifacts in general.They proposed only one quality characteristic of the unit TCs: the code coverage, which is insufficient to measure TCQ.
These studies did not clearly address the problems of the quality of TCs in the ASD environment, in which its testing is continuous.On the other hand, the study focused on TCQ in continuous testing is Lai [15].Lai proposed a test case quality measurement (TCQM) model based on four characteristics: Qualified document, manageable quality, maintainable quality, and reusable quality.In line with the principles of the TCQM model, the procedure for managing the TCQ is designed for timely identification and control of defects of TCs and efficiently enhancing the activities for the effectiveness of Iterative and Incremental Development (IID) continuous testing.However, the TCQM model did not define the characteristics based on practitioners' perspectives; nevertheless, it is very important to determine the TCQ [31].In addition, the TCQM model adopted the Linear Combination Model (LCM), which does not define the measurement goal.Defining the measurement goals is important to guide the practitioners in organizations to derive metrics for each characteristic [69].
The following Table 1 summarizes the existing studies related to the TCQ.From Table 1 above, it is clear that most studies presented the TCQ from a traditional approach.Four studies [15], [23], [41], [49] tried to improve the testing quality in the ASD method.Nevertheless, the majority did not define sub-characteristics; only one study proposed characteristics, sub-characteristics, and metrics developed by Lai [15].Hence, there is a lack of empirical studies of TCQ in most methodologies used in this era, i.e., Agile.
It is also noted that many studies depended on practitioners' opinions on defining the TCQ and identifying the related characteristics.Based on Kochhar et al. [21] and Tran et al. [31], the practitioners have a deeper insight and experience in improving the TCQ.However, the study that clearly defined the quality model of TCQ in ASD (i.e., [15]) did not identify the quality characteristics of TCs based on the practitioners' perceptions.Thus, it is important to study the TCQ from Agile team perceptions.The following section describes the characteristics used to assess the TCQ in literature.

B. TEST CASE QUALITY CHARACTERISTICS
The way to produce good software is to create quality TCs, leading to enhanced testing activities [15], [32].For strengthening and measuring TCQ, the main quality characteristics and the proper metrics of the TC need to be properly defined [56], [70], [71].However, choosing the right metrics is not always straightforward [72], [73].The discussion above presents a weakness in clearly defining the TCQ characteristics, as previous studies have shortcomings in defining measures of TCQ.Therefore, it is necessary to detect the quality characteristics of TCs and define the metrics for measuring them based on the goal of measurement, as Pfleeger [74] suggested.Table 2 shows the identified characteristics and sub-characteristics from previous TCQ models.
Despite the studies mentioned above, only one study has been identified that developed the TCQ model in ASD, as achieved by Lai [15].But this model has some limitations, as mentioned before, so our study will develop a TCQ model for Agile projects by enhancing the Lai model.Therefore, the critical characteristics/sub-characteristics identified from several previous studies are selected with those used in the Lai model, as illustrated in Table 3.
Table 3 shows the characteristics and sub-characteristics of TCQ selected from the Lai model because this model is a benchmark for this study.Eleven characteristics/subcharacteristics are displayed in the third column of Table 3, identified from many previous related studies, and selected for inclusion in the proposed (GOTCA) model.

C. TEST CASE METRICS
In ASD, using metrics helps track the project progress, monitor the quality aspect of the project, help the team forecast, and manage the project better [79], [80], [81].The metrics help clarify the characteristics and sub-characteristics [82], [83].
A few metrics and measurements are defined for ASD methods, and some are easy to use and useful [84].Although few ASD measurement efforts exist, achieving a comprehensive measurement method in all activities and processes is difficult [84].According to [84], only the common software measurement international consortium (COSMIC) published a measurement guideline for ASD methods, and in this guideline, only software size is estimated.
The existing studies discuss the use of metrics in ASD, such as understanding development performance and product quality [80], [85], planning and tracking software development, fixing problems of process, and motivating people [80], ASD process [79], identifying the rationale and the operational challenges associated with metrics [86], an automated TC of unit testing [87], measuring and monitoring the testing process in Scrum [41], measuring and predicting the developed quantitative planning software in Scrum [81], tracking the progress of Scrum [88], testability of software requirements [89], and requirements-associated metrics [90].As shown, many studies have been conducted on the metrics; however, the conducted on the TCQ metrics in ASD are limited.
Recent studies conducted on software testing, like Tran et al. [50], defined 78 metrics for TC specification using the GQM approach.Their metrics are defined for the automotive-specific test case specifications model; thus, they may not be suitable for TCs in other domains.Another study by Nidagundi and Novickis [91] identified 28 metrics from the testability dataset on static tests and code.Among these metrics, three metrics related to testing quality are line coverage, branch coverage, and mutation score.The study aimed to evaluate the testability of Java object-oriented classes.However, the study seems to focus on white box testing.Moreover, both studies, [50] and [91], did not focus clearly on the ASD environment.
The following Table 4 summarizes the studies related to metrics in ASD.Table 4 shows the summary of the studies related to metrics in ASD.Despite prior efforts, many studies focus on measuring the flow of the ASD process.Even those who solely focus on measuring the testing process in the context of ASD limit their examination to project tracking and monitoring.The Table shows that very few studies focus on the TCQ in ASD, such as [92].However, Adlemo et al. [92] used only three insufficient metrics to measure TCQ, and their study estimated the quality of white box TCs, which focuses on the source code.Therefore, Table 5 summarizes a set of objective and subjective metrics identified from the literature.These metrics need to verify whether they VOLUME 11, 2023 106991 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.can be suitable for measuring black-box TCs in an ASD environment.
Table 5 shows the identified TCQ metrics.As is shown, there are 30 objective and 26 subjective metrics.All these metrics need to be verified to determine if they are suitable for measuring the identified characteristics and subcharacteristics.The characteristics, sub-characteristics, and metrics are used to construct the proposed model.

III. PROPOSED GOTCA MODEL
The GOTCA model has been built to address the shortcomings of the prior TCQ models by integrating the ASD required to build high-quality software in today's business context as the reference standard.The GOTCA model includes six characteristics to measure the quality of black-box TCs in ASD.Twenty-two sub-characteristics measure these characteristics, and fifty-six metrics.Thirty were objective metrics, and twenty-six were subjective metrics.These elements are identified in the previous section.The elements of the proposed model need to be verified to examine their relevancy with each other and with black-box TCs in ASD.The following section illustrates how to verify the proposed model.

IV. EXPERT REVIEW PROCESS
Verification checks that the proposed model is good enough for its intended use [120].The purpose of verification is to check the characteristics, sub-characteristics, and metrics of TCQ in the GOTCA model are correct based on the determined verification criteria: comprehensiveness, wellorganized [121], [122], [123], [124], [125], relevancy [80], [125], and consistency [126], [127].This process of verification was accomplished through expert review.The expert review method can easily be conducted in a short time and at less cost, in addition to the fact that it is widely accepted as a significant way to detect defects [128], [129].The expert review is an important way to identify the usefulness and applicability of the research [130].Consequently, a series of activities were itemized to guide the verification stage.Figure 1 shows the flow of the verification process.
The conducted activities through this verification stage were: 1) The authors sent emails to the specified experts, including an overview of the study and a verification form of the GOTCA model.2) The experts reviewed the GOTCA model and had the opportunity to ask questions for further clarification.
3) The experts gave feedback by filling out the verification form, which is an online Google Form.4) The authors updated the model based on the experts' feedback, comments, and suggestions.

A. FIND RELEVANT EXPERTS (IDENTIFY EXPERTS)
A purposive sampling method was used to verify the proposed model, and two types of experts were selected to provide their insights.This method is based on the researcher's judgment as to identifying a limited number of experts in the academics 106992 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and industry that are available and willing to provide the best and most specialized information to succeed for the study objectives [131].The knowledge experts were identified through Research Gate, LinkedIn, Universities websites, and Google Scholar, whereas industry experts obtained via social media such as Facebook and WhatsApp.The experts met the selected criteria and participated in verifying the model.Knowledge experts (academic) were determined based on the following criteria, which were suggested by [130], [132], [121], and [133]: (1) currently lecturing in the field of the study, (2) holds an advanced degree such as PhD in Software Engineering or software testing, (3) faculty members at an accredited university, (4) have authored book/academics materials related to the software testing/ Agile testing, and (5) have at least five years of experience in software testing and ASD.Domain (industry) experts who were chose based on the criteria adapted from [31]: (1) Agile testing practitioners (QA team, testers and/ or developers); and (2) have experience in software testing quality for at least one year.
Based on the expert criteria, ten experts from the academic domain and six from the industry were chosen.At the same time, many academic experts (27 experts) received invitations to become experts in the study by email.The participants had two months to complete the verification form and provide feedback.Participants were reminded every week or two via email to provide their feedback.Thirteen accepted the offer and agreed to test the GOTCA model, but only ten experts did so, and they are sufficient [134], [135].One of the ten experts met with the researcher via an online meeting to discuss some of the verification form's questions.Three people failed to complete the verification form.Due to their hectic schedules, 14 of the 27 experts apologized for participating.
On the other hand, the verification form was sent to many industry experts via WhatsApp, LinkedIn, and email, but only six accepted the invitation and verified the GOTCA model.Therefore, the total number of experts who proved the GOTCA model was 16 from academic and industry areas, as shown in Table 6.
Except for one expert, all of the experts in Table 6 have more than five years of academic experience.Furthermore, all have five or more years of software testing experience, except two participants with fewer than five years of software testing experience.Experts offered their perspectives on the characteristics, sub-characteristics, and metrics.The outcomes of their verification were detailed in the sections that followed.

B. VERIFICATION FORM
The verification form consisted of five main sections to verify.The first section contained respondent information.Five Likert scales are used in the verification form to check the consistency and relevancy of the characteristics, sub-characteristics, metrics, and measures of the subjective metrics to the black-box TCs in ASD in Sections two, three, four, and five, respectively.
After developing the verification form, a draft of the questions was reviewed by four other experts as face validity to determine how well the questions fulfilled the purpose and to ensure that they did not contain any ambiguity.

C. SEND REQUEST PERMISSION AND VERIFICATION FORM
After developing the model, the verification form, and identifying the experts, invitation emails were sent to them to seek their acceptance and arrangement of review.For all experts who accepted the requesting email to be reviewers and agreed to give their feedback, an online verification form (Google Form) and details of the GOTCA model were sent to them through emails.According to Olson [135], the number of experts ranges from two or three to over 20; thus, the sample was sixteen.

D. RECEIVE AND ANALYZE REVIEWS
The collected data from experts were analyzed using descriptive statistics to describe the opinions of respondents about the characteristics, sub-characteristics, and metrics for measuring TCQ in the ASD environment.The experts examined them to know whether they were consistent and relevant to each other and the quality of black-box TCs in ASD.The 5-Likert scale was used this purpose, which ranged from (1) strongly disagree to (5) strongly agree.The Mean value was used for analysis by the Statistical Package for the Social Science (SPSS) tool version 26.The results were calculated by getting the mean score for each item and selecting the appropriate interval representing the actual mean.An appropriate interval scale was required to represent all levels of achievement.Table 7 shows the mean interval presentation and the achievement level adapted from Tegeh and Kirna [136].The ranges of the interval are calculated by using the following Equation (1) [121], [137]: where x is the highest value on the scale, which is 5; thus, the interval value is 0.80, as depicted in Table 7.
We propose that the result will be rejected in the first three levels.The result will be accepted the highest two achievement levels.

V. RESULTS AND DISCUSSION
To achieve this study's objective, which is to define the characteristics, sub-characteristics, and metrics of TCQ in the ASD environment, the experts examined the identified elements to know whether they are relevant to black-box TCQ in ASD.

A. RESULTS OF CHARACTERISTICS AND SUB-CHARACTERISTICS
The experts were asked to verify the identified quality characteristics and related sub-characteristics to know whether it is consistent and relevant to each other and the quality of black-box TCs in ASD.Table 8 shows the opinions of experts about the selected characteristics.
According to the means in Table 8, all experts accepted these characteristics, and their mean values range from 4.13 to 4.56.Hence, the sixth presented characteristics are relevant and consistent with the black-box TCQ in Agile testing.The effectiveness of TCs is the most important (4.56) compared with other characteristics.Followed by documentation and reusability of TCs (4.5).The quality of TC management gained 4.44 from the experts, whereas the efficiency ranged after the management quality, which got a 4.31 mean value.All these five characteristics are Excellent based on the experts' perspectives.The last characteristic that affects TCQ is the TC overall satisfaction (4.13), but it is still accepted.As a result, the proposed GOTCA model includes all these characteristics.
Measurements for these characteristics are presented using 22 sub-characteristics.The experts tested the proposed sub-characteristics to see if they were consistent and relevant to the linked characteristics.The experts' feedback is shown in Tables 9 to 14.
Table 9 shows that all the sub-characteristics of ''documentation quality'' are accepted to be included in the GOTCA model.Although they did not agree about ''specific,'' it is still a good achievement level and acceptable because its mean value is 3.81.According to the mean values, the sub-characteristics can be ordered as correctness, readability, understandability, completeness (Excellent), editability, consistency, and specific or atomic (Good).That refers to correctness as the most important sub-characteristic for the documentation quality of a TC.
Similarly, the proposed sub-characteristics of ''management quality'' were all acceptable.The mean values are more than 4.00, and all are at an excellent level, which means all the experts agreed to be included in the GOTCA model.The results of these sub-characteristics are shown in Table 10.
The experts stressed that TC simplicity is the most important sub-characteristic; its mean value is 4.63.On the other hand, the lowest score was given to ''recoverability'' and ''traceability,'' which had a mean value of 4.31.However, 106996 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.they are shown at the highest achievement level (i.e., Excellent).
For the reusability, four proposed sub-characteristics are verified by the experts.The feedback from experts is illustrated in Table 11.
The experts most agreed about the independence of TCs compared with the other three sub-characteristics.However, the mean values of the three sub-characteristics range from 3.63 to 3.88, so they are Good to be included in the GOTCA model.
The other defined goal of TCQ is ''efficiency'' to measure it; three sub-characteristics were defined.Table 12 illustrates the opinion of experts for verifying it.
The three proposed sub-characteristics are consistent and relevant to ''efficiency.''The mean values of these sub-characteristics are 4.06 (Time behavior and Resource utilization), and 'bug detection' received an Excellent mean value, 4.25.Whereas the experts' opinions between Strongly agree (5) and strongly disagree (1), the result referred to as the sub-characteristics are included in the GOTCA model.
The sub-characteristics of ''effectiveness'' are also accepted by experts, as shown in Table 13.The experts' opinions did not differ; both sub-characteristics (accuracy and repeatability) were excellently achieved; their mean values were 4.38 and 4.25, respectively.Consequently, these sub-characteristics are included in the GOTCA proposed model.
In addition, to test the satisfaction of all these characteristics, ''overall satisfaction of test case quality'' is used and verified through its proposed sub-characteristic ''gain satisfaction.''The results showed the experts are satisfied with the TCs' proposed characteristics, as shown by their feedback on the ''gain satisfaction'' in Table 14.They agreed that ''gain satisfaction'' can be used to measure the ''overall satisfaction of the quality of test cases,'' where the average of all experts' feedback is 4.
In conclusion, from the results, all six characteristics and 22 sub-characteristics are accepted (Good and Excellent) to be included in the proposed model.The next section presents the results of verifying the identified metrics.

B. RESULTS OF METRICS
Because all sub-characteristics of quality TCs are subjective measurements of their linked characteristics, metrics to measure them are required to complete the proposed GOTCA model.There are 30 objective metrics and 26 subjective metrics discovered.The 16th experts verified the metrics to examine whether they are relevant and consistent for measuring the linked sub-characteristics.The expert feedback is summarized in Table 15.
As shown in the Table above, some metrics measure more than one sub-characteristic.But the number of metrics without being repeated is 56.The metrics to quantify the sub-characteristics are approved and verified that they are relevant and consistent to measure black box TCs in agile projects.The mean values vary from Good to Excellent achievement level (3.63 to 4.31), so the experts accepted all the defined metrics.
Overall, the proposed model's verification is satisfactory.According to all experts, characteristics, sub-characteristics, and metrics are comprehensive, relevant, consistent, and wellorganized.Even though the last one (Ex10) disagreed, the proposed model is incomprehensible and poorly arranged.Meanwhile, Ex1 discovered that the proposed model is   well-liked, and Ex2 declared that the model is an innovation related to the TCs.The overall verification of the proposed model according to experts' opinions is presented in Figure 2.
All experts agreed that the generated proposed model elements are relevant, as indicated in Figure 2. Half of the experts (50%) strongly agreed that the developed model elements are relevant, and the other half rated them as Agree.For the consistency of the model elements, 93.75% of the experts decided the characteristics, sub-characteristics, and metrics were consistent.The other criteria (i.e., hensive, understandability, and range from strongly agree to neutral or disagree, with only a few experts rating neutral or disagree.For example, experts' opinions on the understandability of the produced model elements are different, with 81.25% agreeing that the model is understandable.Whereas two experts (12.5%) were unsure, and one (6.25%)disagreed about the understandability of the model.
Based on the results in Figure 2, the proposed model is comprehensive, understandable, relevant, consistent, and well-organized.
The study contributes to software engineering, specifically software testing, in which the proposed model can be used as a reference to conduct research on ASD testing quality.From the theoretical perspective, the model can derive the characteristics, sub-characteristics, and metrics and offer empirical evidence on the significance of these elements on the TCQ level in the software testing industries, thus enriching the existing literature.
In addition, this study will improve ASD testing indirectly by designing effective TCs that can overcome testing problems from continuously changing requirements.The findings of this study can be used as guidance in future research related to quality testing in ASD.The body of knowledge in the ASD testing field in design and evaluation will benefit from the GOTCA model.

D. THE REVISED MODEL
The GOTCA model elements were verified based on comprehensiveness, relevancy, consistency, and well-organization.All elements agreed to be consistent and relevant to black-box TCs in ASD from the experts' perspectives.The GOTCA model did not change after experts' feedback regarding characteristics and sub-characteristics.In terms of metrics, they did not change after verification from the experts' overview as well; the change is done on the type of metrics.The revised model of TCQ in the ASD environment comprises six characteristics with 22 sub-characteristics measured by 56 metrics.Six objective metrics related to time measurements have been changed to subjective metrics in the revision model due to its difficulty in quantitatively measuring them.These metrics are M19, M22, M23, M25, M30, and M31.Therefore, the developed model comprises 24 objective metrics and 32 subjective metrics.
The value of these six transferred metrics should be time.Because it is difficult to put time value as a percentage to be consistent with other metrics values for determining the quality level of TCs, these metrics are transferred to measure subjectively.Therefore, these transferred metrics are measured using the related Agile testing practices.Consequently, their values will be scored based on the practitioner's perspective.In addition, Ex16 suggested adding one significant practice to measure understandability.Thus, the total number of practices that measure the subjective metrics is 106. Figure 3 shows the revised model.

VI. CONCLUSION AND FUTURE WORK
This study identified the characteristics, sub-characteristics, and metrics that could improve the TCQ and enhance software testing quality in an ASD environment.We conducted the verification stage using an expert review method involving sixteen knowledge and domain experts.The experts reviewed the proposed model elements and verified whether they are suitable for measuring black-box TCs' quality in ASD.In addition, they verified the overall proposed model based on four criteria: comprehensiveness, consistency, relevancy, and well-organized.
However, the model needs to be implemented in real life and cover the white-box TCs.
Currently, the authors are working on in-depth validation of the proposed model to test its feasibility and applicability in real life.Future work will be extending the model to cover white-box TCs.

TABLE 7 .
Representations of Mean interval values with achievement levels.

TABLE 8 .
Quality characteristics of TCQ.

TABLE 9 .
Sub-characteristics of documentation quality.

TABLE 10 .
Sub-characteristics of management quality.

TABLE 12 .
Sub-characteristics of efficiency.

TABLE 14 .
Sub-characteristics of overall satisfaction of test cases.

TABLE 15 .
Results of metrics.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.