Skip to Main Content
Heterogeneous (mixed-type) data present significant challenges in both supervised and unsupervised learning. The situation is even more complicated when nominal variables have several levels (values) that make using indicator variables (for every categorical level) infeasible. With unsupervised learning, several fairly involved, computationally intensive, nonlinear multivariate techniques iteratively alternate data transformations with optimal scoring. These seek to optimize an objective on the basis of a covariance matrix. Our goal is to find a computationally efficient and flexible method for mapping categorical variables to numeric scores in mixed-type data. We attempt to go beyond optimizing second-order statistics (such as covariance) and enable distance-based methods by exploring mutual relationships or bumps of dependencies between variables. This is a new objective for a scoring method that's based on patterns learned from all the available variables.