Skip to Main Content
A high-assurance system is largely dependent on the quality of its underlying software. Software quality models can provide timely estimations of software quality, allowing the detection and correction of faults prior to operations. A software metrics-based quality prediction model may depict overfitting, which occurs when a prediction model has good accuracy on the training data but relatively poor accuracy on the test data. We present an approach to address the overfitting problem in the context of software quality classification models based on genetic programming (GP). The problem has not been addressed in depth for GP-based models. The presence of overfitting in a software quality classification model affects its practical usefulness, because management is interested in good performance of the model when applied to unseen software modules, i.e., generalization performance. In the process of building GP-based software quality classification models for a high-assurance telecommunications system, we observed that the GP models were prone to overfitting. We utilize a random sampling technique to reduce overfitting in our GP models. The approach has been found by many researchers as an effective method for reducing the time of a GP run. However, in our study we utilize random to reduce overfitting with the aim of improving the generalization capability of our GP models.