Nowadays, huge amounts of information from different industrial processes are stored into databases and companies can improve their production efficiency by mining some new knowledge from this information. However, when these databases becomes too large, it is not efficient to process all the available data with practical data mining applications. As a solution, different approaches for intelligent selection of training data for model fitting have to be developed. In this article, training instances are selected to fit predictive regression models developed for optimization of the steel manufacturing process settings beforehand, and the selection is approached from a clustering point of view. Because basic k-means clustering was found to consume too much time and memory for the purpose, a new algorithm was developed to divide the data coarsely, after which k-means clustering could be performed. The instances were selected using the cluster structure by weighting more the observations from scattered and separated clusters. The study shows that by using this kind of approach to data set selection, the prediction accuracy of the models will get even better. It was noticed that only a quarter of the data, selected with our approach, could be used to achieve results comparable with a reference case, while the procedure can be easily developed for an actual industrial environment.
Published in:
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
Date of Conference: 1-8 June 2008