Skip to Main Content
We view the problem of estimating the defect content of a document after an inspection as a machine learning problem: The goal is to learn from empirical data the relationship between certain observable features of an inspection (such as the total number of different defects detected) and the number of defects actually contained in the document. We show that some features can carry significant nonlinear information about the defect content. Therefore, we use a nonlinear regression technique, neural networks, to solve the learning problem. To select the best among all neural networks trained on a given data set, one usually reserves part of the data set for later cross-validation; in contrast, we use a technique which leaves the full data set for training. This is an advantage when the data set is small. We validate our approach on a known empirical inspection data set. For that benchmark, our novel approach clearly outperforms both linear regression and the current standard methods in software engineering for estimating the defect content, such as capture-recapture. The validation also shows that our machine learning approach can be successful even when the empirical inspection data set is small.