Skip to Main Content
A number of V&V datasets are publicly available. These datasets have software measurements and defectiveness information regarding the software modules. To facilitate V&V, numerous defect prediction studies have used these datasets and have detected defective modules effectively. Software developers and managers can benefit from the existing studies to avoid analogous defects and mistakes if they are able to find similarity between their software and the software represented by the public datasets. This paper identifies the similar datasets by comparing association patterns in the datasets. The proposed approach finds association rules from each dataset and identifies the overlapping rules from the 100 strongest rules from each of the two datasets being compared. Afterwards, average support and average confidence of the overlap is calculated to determine the strength of the similarity between the datasets. This study compares eight public datasets and results show that KC2 and PC2 have the highest similarity 83% with 97% support and 100% confidence. Datasets with similar attributes and almost same number of attributes have shown higher similarity than the other datasets.