Skip to Main Content
Synthetic data sets can be useful in a variety of situations, including repeatable regression testing and providing realistic - but not real - data to third parties for testing new software. Researchers, engineers, and software developers can test against a safe data set without affecting or even accessing the original data, insulating them from privacy and security concerns as well as letting them generate larger data sets than would be available using only real data. Practitioners use data mining technology to discover patterns in real data sets that aren't apparent at the outset. This article explores how to combine information derived from data mining applications with the descriptive ability of synthetic data generation software. Our goal is to demonstrate that at least some data mining techniques (in particular, a decision tree) can discover patterns that we can then use to inverse map into synthetic data sets. These synthetic data sets can be of any size and will faithfully exhibit the same (decision tree) patterns. Our work builds on two technologies: synthetic data definition language and predictive model markup language.