Development of Testability Prediction Models Considering Complexity Diversity for C Programs

Testability prediction can help developers identify software components that require significant effort to ensure software quality, plan test activities, and recognize the need for refactoring to reduce the test effort. Previous studies have predicted code coverage as a measure of testability based on software metrics. However, these studies have primarily used object-oriented software with simple code structures. Industrial software developed using C is often more complex than the object-oriented software used in these studies. Models trained primarily on low-complexity training data may have insufficient training for the testability of high-complexity software. In this study, we developed a testability prediction model for C programs by considering the complexity diversity. We analyzed the impact of the complexity of the training/test data on the testability prediction model for C programs. The results showed that the model with the best performance achieves an MAE of 7.436 and an R2 of 0.813. Moreover, the results demonstrated that as the complexity diversity of the training data decreased, MAE increased from 5.203 to 6.361, and R2 decreased from 0.809 to 0.725. Furthermore, the performance of the model trained with low complexity-diversity deteriorated as the complexity level of the test data increased, with MAE increasing from 3.498 to 6.631, and R2 decreasing from 0.841 to 0.687. Additionally, in the correlation analysis between the model performance and the difference in the complexity of the training and test data, a strong correlation was observed, with MAE of 0.898 and R2 of -0.848.


I. INTRODUCTION
Testing is essential to ensuring the quality of software during the software development process.Testing is a costly activity in the software development industry, accounting for approximately 50% of total software development costs [1].In other words, improving the effectiveness and efficiency of testing can lower software development costs.Testability is a measure of testing effectiveness and efficiency [2].High testability means that the software can be tested effectively and efficiently.
Testability prediction can help developers identify software components that require significant effort to ensure software quality, plan test activities, and recognize the need The associate editor coordinating the review of this manuscript and approving it for publication was Claudia Raibulet .
Some testability prediction studies have used testing information, such as test case metrics or testing time, as measures of testability in terms of test effort [3], [4], [5], [6], [7], [8], [9], [10].However, accurately predicting testability based on testing information is challenging because testing depends on the capabilities of the test team or individual testers.
Among the recent testability prediction methods, existing studies use code coverage as a measure of testability in terms of test effectiveness [11], [12], [13].Code coverage is one of the representative indicators that can measure test effectiveness as a test criterion required by ISO 26262 [14], IEC 60601 [15], EN 50129 [16], IEC 61508 [17], and DO 178C [18] industry standards.
However, the upper limitations of the complexity metrics in C/C++ industry standards such as MISRA [20], JPL [21], and JSF [22] indicate that a substantially larger proportion of high-complexity source code may be present in industrial C/C++ software than in previous studies.In GCC 13.1.0,a representative C compiler, functions that exceeded a CC of eight or an SL of four were observed in 7.9% of the functions.Furthermore, in our analysis of a large-scale C source code, which consists of 6,873 functions from a real-world automotive industry domain, we found that 9.6% of the functions exceeded a CC of eight or an SL of four.
If the proportion of high-complexity training data is low, training for the testability of high-complexity software may be insufficient.In other words, to enhance the testability-prediction performance of high-complexity software, it is necessary to use high-complexity training data for model construction.Low-complexity functions, with fewer decision statements and lower nesting levels, have relatively simple conditions.This simplicity of the conditions makes it easier to achieve high code coverage.However, the conditions for high-complexity functions are relatively complex, making it difficult to achieve high code coverage.Therefore, predictions for not only low-complexity functions but also high-complexity functions are important.
We developed a testability prediction model using training data with various complexities for C programs.C language is widely used in embedded systems since it offers efficient memory management and hardware access.C language is being used in popular microcontroller platforms such as the AVR and ARM Cortex-M, as well as in automotive systems.Moreover, many legacy systems have been written in C, and these systems still require maintenance and testing.The testability prediction model for C programs can be usefully utilized by test engineers who want to develop testability models, and by developers who aim to evaluate the quality of C programs.
We then analyzed the differences in prediction performance based on complexity diversity, which refers to the diversity in program complexity.First, we confirmed that a model with excellent prediction performance can be constructed using training data under various complex conditions.In addition, we analyzed whether the prediction performance of the model was affected as the complexity of the training data became biased and diversity decreased.We also examined whether prediction performance decreased when high-complexity test data were input into a model trained with data that exhibited low complexity-diversity. Finally, we verified whether the difference in complexity levels between the test and training data affected the prediction performance of the model.Analysis of the impact of complexity diversity on testability prediction can be useful for test engineers who want to improve the performance of testability prediction models.
We developed a methodology and conducted experiments to address these research questions.
• RQ1: Does a model that considers complexity diversity for C programs exhibit high prediction performance?Which model exhibited the best performance?
• RQ2: Do training data with low complexity-diversity degrade testability prediction performance?Does a lower complexity diversity of the training data lower the prediction performance?
• RQ3: Does the model trained with low complexity-diversity data present lower prediction performance depending on the complexity level of the test data?Does a higher complexity level in the test data result in lower prediction performance?
• RQ4: Does a larger complexity-level difference between the training and test data result in a greater performance difference?Is there a correlation between the complexity level difference in the training and test data and the prediction performance?
To answer these questions, we developed a testability prediction model based on highly diverse training data.In addition, to analyze the impact of the complexity diversity of the training data on the performance of the model, we constructed models based on training data with limited complexity diversity.
To obtain high-complexity-diversity data, we determined the maximum/minimum value of the metric based on the maximum allowed complexity metrics of the MISRA [20], JPL [21], and JSF [22] C/C++ industry standards and automatically generated software under test (SUT) that satisfied various metric combinations.In addition, test data were generated and branch coverage was measured using search-based test data generation, which is a representative test data generation method that effectively generates test data.
A testability prediction model was constructed using a regression analysis based on the training data of various metric combinations.Metrics applicable to C programs affecting the search and solution spaces were selected and used as independent variables for training.When the structure of the code becomes complicated, the conditions required to achieve coverage also become complicated, thus reducing the solution space and making it difficult to find test data that improve the coverage.Branch coverage, a testability measure of test effectiveness, was used as the dependent variable.
The main contributions of this study are as follows: • Unlike previous studies that have primarily focused on object-oriented software, we develop a high-performance testability prediction model specifically for C programs, taking into account complexity diversity.
• We empirically investigate that low complexity-diversity in training data can reduce testability prediction performance.This demonstrates the importance of complexity diversity in training data for testability prediction, a problem not thoroughly examined in previous studies.
• We explore the relationship between training and test data complexity, proving that testability prediction performance decreases with less complex training data and more complex test data.This finding highlights the need for high complexity-diversity training data to improve testability prediction performance for complex C programs.
The structure of the remainder of this paper is as follows: Section II describes the related work.Section III describes the methodology used to construct a testability prediction model with a high complexity-diversity for C programs.Section IV analyzes the experiments for each research question.Section V discusses the results and limitations.Finally, Section VI concludes the study.

II. RELATED WORK
Several standards have defined testability, none of which specifies concrete measurement methods, leading to the study of various approaches.According to Garousi et al. [23], there are more than 30 definitions of testability.Representative definitions of testability are as follows: • ISO 9126 [24]: ''attributes of software that bear on the effort needed to validate the software product.'' • ISO 25010 [2]: ''degree of effectiveness and efficiency with which test criteria can be established for a system, product or component and tests can be performed to determine whether those criteria have been met.''Existing studies on metric-based testability prediction have focused mainly on testability in terms of test effort, as defined by ISO 9126, and test effectiveness, as defined by ISO 25010.Studies have been conducted on testing efforts using test information [3], [4], [5], [6], [7], [8], [9], [10] and on testing effectiveness using code coverage [11], [12], [13].
Gupta et al. [3] proposed a testability index defined using a fuzzy approach to CC and object-oriented metrics for 25 Java classes.They divided the values into preferred, acceptable, and not acceptable for each metric and built a fuzzy model to define the testability index.In their analysis of the correlation between the testability index and unit testing time, they found that the testability index was strongly correlated with testability.However, generalizing the results is challenging because the testing time depends on the capabilities of the tester and SUT features.Bruntink and Deursen [4] analyzed the correlation between object-oriented metrics and testability using the lines of code for test classes and the number of test cases as measures of testability in five Java systems.
Singh and Saha [5] and Badri and Toure [6] constructed prediction models for object-oriented metrics and testability using the lines of code for test classes and the number of assertions.Singh and Saha [5] used linear regression and regression trees on three open-source Java systems, and Badri and Toure [6] used logistic regression analysis on three open-source Java systems.
Toure and Badri [7] built a testability prediction model based on LOC and object-oriented metrics using four regression algorithms on ten open-source Java systems.They classified a class as tested or untested as a measure of testability, assuming that the classes selected by the testers required test effort.They predicted whether a class needed to be tested on the basis of these metrics.
Albattah [8] built a testability prediction model using package-level cohesion metrics and logistic regression on five open-source Java systems.They measured testability by the lines of code for test classes.
Terragni et al. [9] conducted a correlation analysis between 28 object-oriented metrics and testability using normalizing test effort with test quality on open-source Java systems.Six test-case metrics were used as test effort, while statement coverage, branch coverage, and mutation score were used as test quality.They showed that normalizing test effort with test quality increases the correlation between object-oriented metrics and test effort.
Testing depends on the capabilities of the test team or the individual testers.Therefore, it is difficult to predict the test-case metrics or testing times using only these metrics.According to Bajeh [10], there is a significant relationship between the metrics and test-case metrics, but the magnitude of the relationship is low.This implies that the metrics alone do not accurately measure the task of developing test cases.
Wang et al. [25] proposed a state-based testability model for testing systems with multi-state characteristics.They used fault detection rate, fault isolation rate, and state detection rate as measures of testability.However, the performance of the model can vary depending on the domain knowledge and capabilities of the test engineer constructing the state-based testability model.
Recent studies have used code coverage as a measure of testability effectiveness.Grano et al. [11] constructed a prediction model for package, object-oriented, CK, Halstead metrics, and branch coverage by using four regression algorithms on seven open-source Java systems.They measured branch coverage using search-based test data generation with GA and random approaches.
Zakeri-Nasrabadi and Parsa [12], [13] constructed a testability prediction model based on object-oriented metrics for 110 open-source Java systems.In [12], five regression algorithms were used to build a prediction model using the product of the average branch coverage, statement coverage, and minimum test case ratio to improve the coverage as a measure of testability.In [13], seven regression algorithms were used to build a prediction model using the average branch coverage and statement coverage divided by the average time to improve coverage as a measure of testability.
These studies were conducted on object-oriented systems with a high proportion of simple methods without analyzing the diversity of complexity in the training data.According to an analysis using Understand [26], a commercial metric analysis tool, the proportions of non-abstract methods (excluding the test suite) with a CC of eight or less and an SL of four or less in the systems used by Grano and Zakeri-Nasrabadi were both 97.1%.However, a methodology for constructing a testability prediction model using software metrics for C programs has not yet been established, and there has been no detailed analysis of the impact of complexity on testability prediction.

III. METHODOLOGY
This section describes the construction of a testability prediction model for C programs based on high-complexitydiversity training data.It describes a method for generating software under test (SUT) to obtain training data with high complexity-diversity, a method for generating test data to measure coverage, a metric collection used as an independent variable, and a method for predicting testability using regression analysis.Fig. 1 illustrates the procedure for constructing the testability prediction model.
We generated SUTs with high complexity-diversity implemented in C using the metric maximum/minimum range determined using the upper limit presented in C/C++ industry standards.We performed search-based test data generation on the generated SUT for the branch coverage measurements.We analyzed the metrics of SUT and collected those to be used as independent variables in the construction of a testability prediction model.We built a testability prediction model using the measured branch coverage as a dependent variable, collected metrics as independent variables, and predicted testability using the constructed model.

A. SUT GENERATION
SUT was automatically generated to obtain training data with a high complexity-diversity.Complexity and size metrics that can affect the performance of search-based test data generation, such as cyclomatic complexity (CC), number of structuring levels (SL), number of equality operators (NOEO), and number of parameters (NOP), were selected as the criteria for SUT generation.The higher the CC, the higher the number of branches that must be executed to increase branch coverage.As SL increases, the maximum nesting depth of the decision statement also increases, complicating the conditions that must be satisfied to execute the decision statement.The larger the NOEO, the more difficult it is to determine a solution that satisfies the branch conditions.This is because the proportion of solutions that satisfy the equality comparison in the domain space is limited.Therefore, the search for a solution to the conditions in which the equality operator exists is more difficult than the search for conditions in which the relational operator exists.As NOP increases, the size of the domain space that must be explored for test data generation becomes wider.
The range of each metric value is determined based on the upper limit presented in the MISRA [20], JPL [21], and JSF [22] C/C++ industry standards.Table 1 lists the ranges of the metric values used to generate SUT.
The standard only provides the upper limit of the metrics and not the minimum value.In this study, the minimum value is established by assuming at least one decision statement and two or more parameters.Because the nesting depth cannot exceed the number of decision statements, we used a smaller value between CC-1 and 6 (six being the upper limit in the standard) as the maximum value for SL.As the standard does not reference NOEO, we assumed that NOEO would be used less frequently than the decision statements and used CC -1 as its maximum value.Based on these constraints, 3,320 feasible metric combinations were identified.
We generated ten programs for each metric combination, for a total of 33,200 programs.Because automatically generated programs may contain infeasible branches, we use Joggie [27], a tool for detecting infeasible codes, to remove these infeasible branches.Joggie assessed its feasibility by using the Princess solver [28].The Princess solver, a tool that checks whether the conditions are satisfiable, serves as an essential feature in infeasible code detection.The Princess solver is utilized in JavaSMT [29], a unifying Java interface for SMT solvers, and Eldarica [30], a predicate abstraction-based model checker.

B. TEST DATA GENERATION
Search-based test data generation is an automatic generation method of test data that uses search algorithms to explore test data that meet test goals.We generated test data using branch coverage as a test goal and the genetic algorithm (GA), hill climbing (HC), and random (RND) methods as search algorithms.The GA is a global search algorithm that finds values that improve branch coverage by changing values through crossover and mutation operations.Values that satisfy relational comparisons are found quickly; however, more searches are required to identify the values that satisfy equality comparisons.HC is a local search method used to determine a value that improves branch coverage by changing the value to a relatively small value, which is the neighboring 98472 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.value of the current value, as the next value.The HC tends to find values that satisfy the equilibrium comparison more quickly than the GA.Because RND is a method for generating arbitrary values without a separate search technique, the branch coverage is determined by the ratio, the proportion of the solution space within the domain space.We used the value from [31] for the crossover probability, and the values from [32] for the rest of the search algorithm parameters.Table 2 lists the search algorithm parameters used to generate test data.
For the range of each variable, [-2 31 , 2 31 -1], which is the range of values that can be expressed as a variable of 4 bytes, was used.The results were derived by repeating each program ten times to reduce the randomness of the search-based testing method.According to an existing study [33], searchbased testing results require at least ten repetitions to achieve minimal statistical power.

C. METRIC COLLECTION
We selected complexity metrics that affected the search for test data to improve branch coverage and collected the independent variables.All the selected metrics were applicable to C programs.CC, SL, NOP, and NOEO were used to generate the SUT, and the number of paths (NPath) [34], which are metrics that express software complexity and Halstead complexity [35], program length, program vocabulary, volume, and difficulty, were used.
The CC, SL, NOP, and NPath metrics were measured using Understand [26], a commercial metric measurement tool, and the Halstead complexity metrics were measured using PC Lint Plus [36], another commercial metric measurement tool.NOEO was measured by developing a custom tool to count the number of == and != operators within the function.
HGBR [37] is a representative ensemble machine-learning method.The training data were separated into bins for each feature based on feature percentiles.The gradient boosting algorithm trains a sequence of decision-tree sizes to minimize the global loss function.Gradient boosting algorithms are used to train a sequence of decision tree sizes to minimize the global loss function.This allows the algorithm to leverage histograms instead of relying on sorted continuous values when building the trees.
RFR [38] is another representative ensemble machinelearning method.It combines multiple decision trees, each built using subsamples of the dataset, and uses averaging.
By averaging these predictions, some errors can be canceled out because of the tendency of individual decision trees to overfit.
DTR [39] predicts the value of a target variable by learning the decision rules inferred from data features in a tree structure.A decision tree was constructed by dividing the dataset into smaller subsets.DTR is easy to understand and interpret.However, it has disadvantages, such as overfitting and generating biased trees when some classes are dominant.
MLPR [40] is a fully connected class of feedforward artificial neural networks consisting of multiple layers of numerous computational neurons.An MLPR consists of at least three layers of nodes: an input layer, an output layer, and one or more nonlinear hidden layers.Except for the input nodes, each node is a neuron that uses a nonlinear activation function.Neural networks are difficult to tune in terms of hyperparameter variables but have the advantage of training nonlinear interactions between features.
LR [41] is the first type of regression analysis used to model the relationship between a target variable and one or more input variables.The model parameters were estimated by minimizing the error between the predicted and actual values.LR has the advantages of simplicity and ease of interpretation.
HR [42] is a linear regression method that is robust to outliers.It uses the Huber loss function, which is less sensitive to outliers in the data than the mean squared error.
SGDR [43] is a linear regression method that uses stochastic gradient descent as an optimization algorithm to fit the model.The model is particularly useful for large-scale sparse datasets.
Scikit-learn [44] was used to build the testability prediction models.Scikit-learn is a representative Python machine-learning library that provides various classification, regression, and clustering algorithms.We built testability prediction models using scikit-learn's HistGradientBoostingRegressor, RandomForestRegressor, DecisionTreeRegressor, MLPRegressor, LinearRegression, HuberRegressor, and SGDRegressor.For hyperparameter tuning of the regression model, GridSearchCV, a search method that investigates all parameter combinations, was used.
Hyperparameter tuning was performed for each regression algorithm to identify optimal parameters.The hyperparameter is an adjustable parameter whose value is used to control the model training process.The model performance significantly depends on the hyperparameters.
We used the values from [13] for HGBR, RFR, DTR, SGDR, [12] for MLPR, and [11] for HR to tune the parameters and search ranges.However, for HR, we added the max_iter parameter because the regression analysis library reported a warning message to increase max_iter.Table 3 presents the search scope for hyperparameter determination.
We constructed testability prediction models for the GA, HC, and RND datasets.Each dataset comprised 33,200 data points derived from test data generation on SUTs of all complexity combinations.
The testability prediction models were evaluated using performance metrics such as the coefficient of determination (R 2 ), mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).R 2 represents the extent to which an independent variable explains a dependent variable.The MAE is an easy-to-interpret error metric because it has the same units as the actual value.MSE is an error metric that employs squared operations to sensitively reflect errors.In other words, if an outlier exists, the value fluctuates significantly compared with the MAE.RMSE is an error metric that takes the square root of MSE and converts it into units similar to the actual values.R 2 implies that the larger the value, the better the performance of the model, whereas MAE, MSE, and RMSE, as error metrics, indicate that the smaller the value, the better the performance of the model.

IV. EXPERIMENTS
This section describes the experiments conducted to answer the research questions.First, to verify whether a high-performance testability prediction model can be constructed using C programs with complexity diversity, we built prediction models using complexity diversity training data and various regression algorithms.The performance of the models was analyzed using statistical methods.To determine whether training data with low complexity-diversity reduces testability prediction performance, we constructed various prediction models using training data that exhibited different levels of low complexity-diversity and compared their performances.Furthermore, we investigated whether the prediction performance of models trained with low complexity-diversity data was degraded as the complexity level of the test data increased by sampling test data with various complexity levels and comparing their performances.Finally, we analyze the correlation between the complexity-level differences between the training and test data and the performance metrics to determine whether the difference in the complexity levels between the training and test data leads to performance degradation.

A. RQ1: CONSTRUCTION OF PREDICTION MODELS CONSIDERING COMPLEXITY DIVERSITY FOR C PROGRAMS
Previous studies constructed testability prediction models using data with low complexity-diversity.In other words, both the training and test data had a higher proportion of low complexity.However, low-complexity source codes tend to have fewer branches and simpler conditions, making it easier to achieve high code coverage.We investigated whether we could build high-performance models for C programs using training data with high complexity-diversity by increasing the proportion of high-complexity data.
We built models using seven learning algorithms -HGBR, RFR, DTR, MLPR, LR, HR, and SGDR -with high complexity-diversity training data generated using a combination of metrics within the maximum and minimum ranges determined based on C/C++ industry standards.Because the coverage achievement varied for each TDG algorithm, the datasets were used separately for the three TDG algorithms: GA, HC, and RND.In other words, 21 models were constructed by combining the seven regression algorithms with the three TDG datasets.
We validated the performance of the models using k-fold cross-validation with randomly shuffled datasets comprising 33,200 data points for each TDG algorithm.K-fold cross-validation divides the dataset into k parts, using k-1 parts as training data and the remaining one as validation data.While there is no formal rule for the choice of K, we used five which is usually selected [45].
The experiment was repeated 30 times to reduce the data randomness caused by shuffling.Performance metrics used were MAE, MSE, RMSE, and R 2 .Fig. 2 shows the experimental process of RQ1.
We calculated the mean MAE and R 2 to verify whether the models trained with high complexity-diverse data exhibited a high prediction performance.Fig. 3 shows the mean MAE and R 2 results for RQ1.
We observed a mean MAE between 4.502 and 7.734, and a mean R 2 between 0.579 and 0.866.Tree-based HGBR, RFR, and DTR showed better mean performances than neuralnetwork-based MLPR and linear-based LR, HR, and SGDR.Considering the R 2 values of 0.525 [11], 0.680 [12], and 0.678 [13] in previous testability prediction studies, the results indicate that the model has moderately high performance.The mean MAE and R 2 performance of the models improved in the order of HGBR, RFR, DTR, MLPR, LR, HR, and SGDR for all datasets.
We compared the performance of the models built using each regression algorithm to determine which model exhibited a statistically significant difference.We performed an ANOVA to analyze the differences among the means for statistical analysis.Prior to the ANOVA, we performed Levene's test [46] to assess the homogeneity of variances for the measured performance metrics.As Levene's test did not pass, we used Welch's ANOVA, an analysis method applicable when the homogeneity of variances assumption was not met, and the Games-Howell post-hoc test [47] for the analysis.
Table 4 presents the experimental results for RQ1 separated by the dataset and regression algorithms.The experimental results across Tables 4 through 8 are presented as mean ± standard deviation (SD), and the letters to the right of the SD represent the groups with statistically significant differences according to the Games-Howell test results with a p-value below 0.05.Different letters indicate statistically significant differences between the groups.
In all cases, M HGBR exhibited the best performance, followed by M RFR .In some cases, no statistically significant differences were observed between M DTR , M MLPR , and M LR ; however, the order of model performance remained consistent: M DTR , M MLPR , and M LR .In all cases, M HR and M SGDR exhibited the lowest performances.

B. RQ2: PERFORMANCE ANALYSIS BASED ON THE COMPLEXITY DIVERSITY OF TRAINING DATA
To analyze whether the prediction performance decreased with lower complexity diversity in the training data, we compared the performances of the models trained using data with low complexity-diversity.We separated the training data into low-complexity and high-complexity data and generated training data with low complexity-diversity by combining them such that the proportions of low complexity-data were 90%, 95%, 98%, and 99%, respectively.As the criteria for distinguishing between low and high complexity, we used a CC of eight or lower and an SL of four or lower.This criterion was determined based on the observation that SF110, a representative SUT used in search-based test data generation [19] and employed in related studies [12], [13], had 97.1% nonabstract methods with a CC of eight or lower and an SL of four or lower.Fig. 4 illustrates the experimental process for RQ2.
We used 4,800 training data points and 1,200 test data points for RQ2.M All was trained using 4,800 training data samples from the dataset.M LC90 was trained using the training data sampled from 4,320 low-complexity training data points (CC ≤ 8 and SL ≤ 4) and 480 high-complexity training data points (CC > 8 or SL > 4).In other words, 90% (4,320 out of 4,800) of the training data comprised low-complexity data, and 10% (480 out of 4,800) consisted of high-complexity data.M LC95 , M LC98 , and M LC99 were trained using the same method, with the training data consisting of low-complexity data in proportions of 95, 98, and 99%, respectively.The four models, M LC90 , M LC95 , M LC98 , and M LC99 , are collectively referred to as LC models.
We used 1,200 samples of test data from the dataset, excluding the training data.To reduce the randomness of the training and test data due to sampling, the experiment was repeated 30 times.
We used HGBR, which exhibited the best performance for RQ1, as the regression algorithm to build the prediction model.The experimental results were measured and analyzed using MAE, MSE, RMSE, and R 2 , and an ANOVA test was performed.Fig. 5 shows the mean MAE and R 2 results for RQ2.
The mean MAE and R 2 performances of the models were consistently better in the order of M All , M LC90 , M LC95 , M LC98 , and M LC99 across all datasets.In other words, we confirmed that the mean prediction performance of the models decreased as the proportion of low-complexity data increased and the complexity diversity of the training data decreased.The differences in the mean MAE and R 2 of M All compared with those of M HGBR in RQ1 were owing to the use of sampled training data to limit the complexity diversity of the training data.Table 5 presents the experimental results for RQ2 separated by the dataset and the proportion of low-complexity data in the training data.
The mean performance consistently improved in the order of M All , M LC90 , M LC95 , M LC98 , and M LC99 across all results.The ANOVA results do not always indicate statistically significant performance improvements.However, M All always demonstrated significantly better performance than the LC models.Moreover, M LC90 always exhibited a significantly better performance than M LC98 and M LC99 , and M LC95 always exhibited a significantly better performance than M LC99 .Therefore, the results indicate that the prediction performance tends to decrease as the complexity VOLUME 11, 2023 98475 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.diversity of the training data decreases, following the order of M All , M LC90 , M LC95 , M LC98 , and M LC99 .

C. RQ3: PERFORMANCE ANALYSIS BASED ON THE COMPLEXITY LEVEL OF TEST DATA
We investigated whether the prediction performance of models trained on data with low complexity-diversity decreased as the complexity level of the test data increased.If the difference in the complexity levels of the test data reduces the prediction performance of LC models, it could be challenging to accurately predict the testability of complex source codes.To distinguish the complexity levels of the test data, we sorted them by the sum of the CC and SL values.Even if the CC is high, a low SL indicates fewer nested conditions, which  means that the conditions are relatively easy to cover.Conversely, even if SL is high, a low CC implies that there are fewer conditions to cover.Therefore, we identified a combination of CC and SL as features that influence code coverage, which is a measure of testability effectiveness.The sorted test data are divided into five parts, each representing a different level of complexity.LC models were constructed using RQ2.Fig. 6 illustrates the experimental process for RQ3.
We analyzed the performance using test data with divided complexity levels on models with low complexity-diversity, which were constructed using the same method as in RQ2.
To identify test data with different levels of complexity, we sampled 6,000 data points, excluding the training data.The sampled data were then sorted based on the CC+SL values and subsequently divided into five parts.We classified the 1,200 test data points differentiated by complexity level as CL1 for data with the lowest complexity level and CL5 for data with the highest complexity level.The experiment was repeated 30 times because of the randomness of the training and test data caused by sampling.
We obtained 25 performance results using five test datasets from five models.The experimental results were analyzed using the MAE, MSE, RMSE, and R 2 , and an ANOVA test was performed.The experiment was conducted separately for each dataset, and we subsequently analyzed whether there was a similar tendency across all the datasets.Fig. 7 presents the mean MAE and R 2 results for the GA dataset for RQ3.
The experimental results indicate that M All , which was built using training data without complexity limitations, demonstrated a mean MAE of 4.579 or less and a mean R 2 of 0.866 or higher at all complexity levels of the test data.For the CL1 test data, all models exhibited a mean MAE of 3.916 or less and a mean R 2 of 0.871 or higher.In the CL1 test data, the LC models showed high performance similar to that of M All .This was because the LC models were built using a high proportion of low-complexity training data, similar to the CL1 test data.
All LC models showed a decrease in prediction performance in terms of the mean MAE and R 2 as the complexity level of the test data increased from CL1 to CL4.However, despite the increase in the complexity level of the test data from CL4 to CL5, the MAE decreased slightly for M LC98 , and R 2 increased slightly for M LC90 , M LC95 , and M LC98 .Table 6 presents the experimental results for RQ3 in the GA dataset, separated by the proportion of low-complexity data in the training data and the complexity level of the test data.
M All did not exhibit a consistent performance difference based on the complexity level of the test data in all cases.The mean performance consistently decreased from CL1 to CL4 across all LC models.In some LC model results, a reversal of mean performance was observed between CL4 and CL5.
However, in all LC model results, CL2-CL5 always showed a statistically significant lower performance than CL1, and CL4 always showed a statistically significant lower performance than CL2.Furthermore, except for the MSE and RMSE between CL4 and CL5 for M LC90 and M LC95 ,  no statistically significant performance improvement was observed between CL4 and CL5.Therefore, despite some exceptions, the prediction performance tends to decrease as the complexity level of the test data increases in the GA dataset.Fig. 8 shows the mean MAE and R 2 results for the HC dataset of RQ3.
The experimental results indicate that M All demonstrates a mean MAE of 5.229 or less and a mean R 2 of 0.736 or higher across all complexity levels of the test data.For the CL1 test data, all models exhibited a mean MAE of 4.458 or less and a mean R 2 of 0.702 or higher, with LC models exhibiting a slightly lower performance in terms of both MAE and R 2 than M All .
The mean MAE consistently increased from CL1 to CL5 across all LC models.The mean R 2 was lower in the order of CL1, CL2, CL4, and CL5 across all LC models.However, for M LC90 , M LC95 , and M LC98 , there were instances in which R 2 for CL3 was slightly higher than that for CL2.Table 7 presents the experimental results for RQ3 for the HC dataset.
Similar to the GA dataset, M All did not exhibit a consistent performance difference based on the complexity level of the test data in all cases.The mean performance consistently decreased in the order of CL1, CL2, CL4, and CL5 across all LC models.In some LC model results, a reversal of mean performance was observed between CL2 and CL3.
However, in all LC model results, CL2-CL5 always showed a statistically significant lower performance than CL1, and CL5 always showed a statistically significant lower performance than CL1-CL4.Furthermore, in all LC models, although there were instances where the performance of the model significantly decreased as the complexity level of the test data increased, no instances of a statistically significant increase were observed.Therefore, for the HC dataset, the prediction performance decreased as the complexity level of the test data increased.Fig. 9 presents the mean MAE and R 2 results for the RND dataset for RQ3.
The mean MAE consistently increased from CL1 to CL5 across all LC models.Similarly, the mean R 2 consistently decreased from CL1 to CL5 across all LC models.The RND dataset consistently demonstrated a decrease in mean performance in terms of MAE and R 2 as the complexity level of the test data increased.Table 8 presents the experimental results for RQ3 for the RND dataset.
In all cases, M All shows statistically significant differences between CL1 and CL2-CL4.However, there was no consistent relationship between CL2-CL4.Across all LC models, the mean performance consistently decreased from CL1-CL5, except for M LC90 and M LC98 , and between CL4 and CL5 in terms of MSE and RMSE.
Furthermore, except for the MAE, MSE, and RMSE between CL4 and CL5 for all LC models, statistically significant performance increases were observed in all cases as the complexity level of the test data increased.Therefore, for the RND dataset, the prediction performance decreases as the complexity level of the test data increases.

D. RQ4: PERFORMANCE ANALYSIS BASED ON THE DIFFERENCE IN COMPLEXITY LEVELS OF TRAINING AND TEST DATA
We investigated whether the larger the difference in the complexity levels between the training and test data, the greater the difference in the prediction performance of the model.If the difference in complexity levels between the training and test data linearly decreases the prediction performance, it can be expected that the lower the complexity diversity of the training data, the greater the decrease in the prediction performance for high-complexity source code.To measure the difference in complexity levels between the training and test data, we employed the concept of effect size, which measures the difference between the means of the two groups relative to the standard deviation of the data.Specifically, we use Cohen's d [48], a representative method for computing the effect size, which is calculated as the difference between the means divided by the pooled standard deviation of the data.
To calculate the difference in complexity levels, we used the model and test data from RQ3.The experiments were conducted using 25 combinations of five distinct models and five CL test data.Each combination was repeated 30 times to reduce the random sampling effects.Thus, the experiment was conducted 750 times, and the difference in complexity levels between the training and test data was calculated each time.We use the CC+SL complexity level indicator, which is  consistent with RQ3.Scatter plots were drawn to analyze the relationship between the complexity-level difference and R 2 .Fig. 10 presents a scatter plot of the differences in complexity levels between the training and test data and R 2 .
It can be observed that in all the datasets, there is a tendency for R 2 to decrease as Cohen's d increases.To quantitatively measure the relationship between Cohen's d and the model performance, we conducted a Pearson correlation analysis.Table 9 presents the Pearson correlation coefficient between Cohen's d and the model performance.All correlation coefficients had p-values < 0.05.
The experimental results showed that the absolute value of the correlation coefficient in all cases was 0.804 or higher.Assuming a reasonably sized dataset, a correlation 98480 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.value of less than 0.1 is trivial, 0.1-0.3 is minor, 0.3-0.5 is moderate, 0.5-0.7 is large, 0.7-0.9 is very large, and 0.9-1 is almost perfect [49].MAE, MSE, and RMSE are error metrics; the larger their values, the lower the prediction performance.Therefore, a positive correlation implies that the greater the difference in complexity, the lower the prediction performance.R 2 denotes the model's explanatory power.The smaller the value, the lower the prediction performance.Hence, a negative correlation indicates that the larger the difference in complexity, the lower the prediction performance.Based on the experimental results, we discovered a strong relationship between the prediction performance of the model and the difference in complexity levels between the training and test data.

V. DISCUSSION AND LIMITATIONS
This section describes the discussion on the experimental results and the threats to validity.In the discussion, we summarize the experimental results for each RQ, analyze the implications of these results, and suggest possibilities for expanding research on testability prediction.In the threats to validity, we analyze the possible threats that might affect the validity of the results.

A. DISCUSSION
We summarize the experimental results by averaging the outcomes across all datasets.Fig. 11 presents the mean results across all datasets for each RQ.The RQ1 charts depict the average MAE and R 2 for each regression algorithm.For RQ2, the charts display the average MAE and R 2 based on the complexity diversity of the training data.The RQ3 charts show the average MAE and R 2 across all LC models, categorized by the complexity level of the test data.Lastly, the RQ4 chart exhibits the absolute value of the correlations between the difference in complexity levels of the training and test data and the MAE and R 2 .
In RQ1, the regression algorithm with the highest mean performance was HGBR, which showed an MAE of 5.125 and an R 2 of 0.813.Tree-based HGBR, RFR, and DTR exhibited better mean performances than the neuralnetwork-based MLPR and linear-based LR, HR, and SGDR.HGBR showed an MAE that was 11.6% lower and an R 2 that was 4.3% higher than MLPR.Compared to linear-based algorithms, HGBR showed an MAE that was 24.8% to 31.1% lower and an R 2 that was 12.6% to 26.9% higher.
In RQ2, the performance of the models consistently decreased as the complexity diversity of the training data decreased.Compared to All, the MAE of LC90-LC99 was respectively 5.8%, 10.0%, 16.3%, and 22.3% higher, and the R 2 was respectively 3.3%, 4.8%, 7.8%, and 10.4% lower.
In RQ3, the performance of the models consistently decreased as the complexity level of the test data increased.CL1 showed significantly higher performance than CL2-CL5, as it has a similar proportion of low-complexity training data used in the construction of the LC models.Compared to CL2-CL5, the MAE of CL1 was 57.5% to 89.6% lower, and the R 2 was 10.2% to 18.3% higher.There was a smaller performance difference between CL2-CL5, which have a dissimilar proportion of low-complexity data compared to LC models, than with CL1.Compared to CL2, the MAE of CL3-CL5 was respectively 7.7%, 17.7%, and 20.3% higher, and the R 2 was respectively 1.4%, 3.8%, and 9.1% lower.
In RQ4, the correlation between the difference in complexity between training and test data and the MAE and R 2 were 0.898 and −0.848, respectively.The positive correlation in MAE and the negative correlation in R 2 both imply the model's performance decreases as the complexity difference increases.In the chart, we used the absolute values of the correlation coefficients to focus on their magnitudes.
The experimental results indicate the following implications.The high performance of the HGBR model provides accurate predictions of testability, suggesting that it can help developers proactively identify software components that require significant testing effort, plan testing activities, and recognize the need for refactoring to reduce the test effort.The findings of this study highlight the importance of considering complexity diversity in the testability prediction for C programs.The decrease in model performance as the complexity diversity of the training data decreases, the variation in model performance based on the complexity level of the test data, and the correlation between model performance and complexity difference demonstrate the impact of complexity diversity in testability prediction.These findings enhance the understanding of the impact of complexity diversity on testability prediction, an aspect not thoroughly examined in previous studies.These insights can assist test engineers in improving the performance of testability prediction models.
Testability is a multifaceted issue that depends on source code, design patterns, software architecture, process complexity, domain characteristics, programming language, and even the programmer's experience.However, source code is the most concrete object for testability evaluation, and collecting and measuring source code-based metrics is relatively easy.Therefore, we focused our research on source code, but the study can be expanded by considering other factors that affect testability.
Firstly, the study can be expanded by using test patterns that consider domain characteristics.By utilizing test patterns that consider domain-specific behavior and types of defects, a testing process can be made more systematic and effective.Therefore, test pattern-based metrics such as the number of applied test patterns and the complexity of applied test patterns can be used as variables for evaluating testability in terms of testing effectiveness.
For example, Siddiqui and Khan [50] proposed test patterns for cloud applications.They proposed a structure of test patterns and methods for identifying applicable patterns through feature analysis of the test patterns.The test patterns include the test situation, test target, and sequence of actions needed to perform the test.Based on this study, the number of applicable test patterns, the number of applied test patterns, and the complexity of applied test patterns can be used in testability prediction research.And Górski [51] proposed a test pattern for smart contracts.He proposed a test pattern that considers symmetric characteristics based on verification rules for smart contracts.Based on this study, the number of  necessary test cases and the number of performed test cases can be used in testability prediction research.
Secondly, the study can be expanded to other languages.By considering the structural characteristics and types of defects according to the programming language, the effectiveness of testing can be enhanced.We discuss expanding the study to Rust and JavaScript.
Rust is a systems programming language focused on performance, type safety, and concurrency.Rust ensures safety by providing ownership and borrowing, memory management methods that check for memory leaks or invalid references at compile time and prevent race conditions.Considering these language-specific characteristics, metrics such as the number of ownership transfers and the number of borrow checks can be used for testability prediction in Rust programs.
JavaScript is a scripting language for web development.JavaScript supports implicit global, making it easy to use global variables, and callbacks are widely used for programming in asynchronous web environments.Considering these language-specific characteristics, metrics such as the number used of global variables, the number of callbacks, and the nested callback depth can be used for testability prediction in JavaScript programs.

B. THREATS TO VALIDITY
Threats to internal validity include the selection of parameters used in both the generation of test data and the construction of the model, as well as the reduction of randomness in these processes.We used parameter values from previous studies [31], [32] to generate the test data.To obtain reasonable statistical power for the test data generation, we repeated the generation ten times [33].For the model construction, we tuned the optimal parameters using GridSearchCV based on the hyperparameter range of existing studies [11], [12], [13].The experiments were repeated 30 times to reduce the randomness caused by sampling.
Threats to construct validity include obtaining the appropriate tools for the experiment.We developed tools for test data generation and SUT generation to collect training data with high complexity-diversity.Both tools were developed in Java, and we used the JavaCC C parser available on Java.net for the C source code analysis.The test data generation tool developed is available from Zenodo. 1 We automatically generate SUTs with feasible combinations of metric ranges based on C/C++ industry standards.Because the generated SUTs may contain infeasible branches, we use Joggie [27], a tool for detecting infeasible code, to remove infeasible branches.Joggie used a Principle solver [28] to assess its feasibility.The Princess solver, a tool that checks whether the conditions are satisfiable, is an essential feature in infeasible code detection.The Princess solver won the TFA division (arithmetic problems) in the 2012 CADE ATP System Competition, a yearly competition for fully automated theorem provers, and the TFI category (integer problems), and was runner-up in the TFA division in 2013 and 2014.Furthermore, the Princess solver is used in JavaSMT [29], a unifying Java interface for SMT solvers, and Eldarica [30], a predicate-abstraction-based model checker.
Other threats include the methods of metric collection, model construction, and statistical analysis.For the metric collection, we used the commercial metric analysis tools Understand 6.2 [26] and PC-lint Plus 2.0 [36].One of the metrics used, NOEO, was measured using a tool developed to count the numbers of '==' and '!=' operators within a function.We verified the correctness of the measured NOEO by inspecting a subset of the C source codes.We used scikitlearn [44], a Python machine-learning library, for model construction.For statistical analyses, such as ANOVA and correlation analysis, we used IBM SPSS Statistics 27.
Threats to external validity include obtaining large-scale data with high complexity-diversity to generalize the results.Based on the metric upper limits of the MISRA [20], JPL [21], and JSF [22] standards, we generated ten SUTs for each of the 3,320 feasible metric combinations, totaling 33,200 SUTs.
We performed a statistical analysis of the experimental results to ensure the validity of our conclusions.We analyzed not only the mean and SD but also whether there was a statistically significant difference using the ANOVA test.

VI. CONCLUSION
In this study, we developed a testability prediction model for C programs and investigated the impact of the complexity diversity of training and test data on testability prediction performance.We built a model to predict branch coverage, which is a measure of testability effectiveness, using training data with high complexity-diversity and analyzed the prediction performance.To confirm the importance of complexity diversity in testability prediction models, we observed performance differences according to the complexity levels of the training and test data.For the experiment, we generated 33,200 SUTs with high complexity-diversity and measured their branch coverage using search-based test data generation.We collected nine metrics affecting branch coverage achievement.We built a testability prediction model through regression analysis using branch coverage as the dependent variable and these metrics as independent variables.The regression algorithm HGBR, which had the best performance metric value, showed a mean R 2 of 0.813.Through ANOVA, we observed a statistically significant decrease in the performance of the models across all datasets when the complexity diversity of the training data was low.Compared to All, the mean R 2 of LC90-LC99 was respectively 3.3%, 4.8%, 7.8%, and 10.4% lower.In addition, we confirmed that the performance of the model trained with low complexity-diversity data decreased as the complexity of the test data increased.Compared to CL2, the mean R 2 of CL3-CL5 was respectively 1.4%, 3.8%, and 9.1% lower.Finally, we observed a strong mean correlation of 0.848 or higher between the difference in the complexity levels of the training and test data and the performance of the prediction model.Our research assists developers in focusing their testing efforts efficiently and highlights the necessity for test engineers to use training data with complexity diversity.This approach can improve the training process of the testability prediction model for C programs.
In future work, we plan to expand our study by using methods that reflect domain characteristics such as test patterns, and by using language-specific metrics tailored to other languages such as Rust and JavaScript.By using test patterns that consider domain-specific behavior and types of defects in testability predictions, the impact of domain characteristics can be reflected.In addition, by considering language-specific features such as memory management and asynchronous processing methods in testability predictions, the impact of language characteristics can be reflected.

FIGURE 3 .
FIGURE 3. RQ1.Mean MAE and R 2 of regression models in each dataset.

FIGURE 5 .
FIGURE 5. RQ2.Mean MAE and R 2 of models with low-complexity training data in each dataset.

FIGURE 7 .
FIGURE 7. RQ3.Mean MAE and R 2 of models with the different complexity levels of test data in GA dataset.

FIGURE 8 .
FIGURE 8. RQ3.Mean MAE and R 2 of models with the different complexity levels of test data in HC dataset.

FIGURE 9 .
FIGURE 9. RQ3.Mean MAE and R 2 of models with the different complexity levels of test data in RND dataset.

FIGURE 10 .
FIGURE 10.RQ4.Scatter plot of the difference in complexity levels between training and test data and R 2 .

FIGURE 11 .
FIGURE 11.Summary of experimental results.

TABLE 2 .
Test data generation parameters.

TABLE 4 .
RQ1. Performance of regression models in each dataset.

TABLE 5 .
RQ2. Performance of models with low-complexity of training data in each dataset.

TABLE 6 .
RQ3. Performance of models with the different complexity levels of test data in GA dataset.

TABLE 7 .
RQ3. Performance of models with the different complexity levels of test data in HC dataset.

TABLE 8 .
RQ3. Performance of models with the different complexity levels of test data in RND dataset.