Abstract:
The fundamental issue in cross project defect prediction is selecting the most appropriate training data for creating quality defect predictors. Another concern is whethe...Show MoreMetadata
Abstract:
The fundamental issue in cross project defect prediction is selecting the most appropriate training data for creating quality defect predictors. Another concern is whether historical data of open-source projects can be used to create quality predictors for proprietary projects from a practical point-of-view. Current studies have proposed statistical approaches to finding these training data, however, thus far no apparent effort has been made to study their success on proprietary data. Also these methods apply brute force techniques which are computationally expensive. In this work we introduce a novel data selection procedure which takes into account the similarities between the distribution of the test and potential training data. Additionally we use feature subset selection to increase the similarity between the test and training sets. Our procedure provides a comparable and scalable means of solving the cross project defect prediction problem for creating quality defect predictors. To evaluate our procedure we conducted empirical studies with comparisons to the within company defect prediction and a relevancy filtering method. We found that our proposed method performs relatively better than the filtering method in terms of both computation cost and prediction performance.
Published in: 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement
Date of Conference: 10-11 October 2013
Date Added to IEEE Xplore: 12 December 2013
Print ISBN:978-0-7695-5056-5