Skip to Main Content
Consider a labeled data set of 1 terabyte in size. A salient subset might depend upon the users interests. Clearly, browsing such a large data set to find interesting areas would be very time consuming. An intelligent agent which, for a given class of user, could provide hints on areas of the data that might interest the user would be very useful. Given large data sets having categories of salience for different user classes attached to the data in them, these labeled sets of data can be used to train a decision tree to label unseen data examples with a category of salience. The training set will be much larger than usual. This paper describes an approach to generating the rules for an agent from a large training set. A set of decision trees are built in parallel on tractable size training data sets which are a subset of the original data. Each learned decision tree will be reduced to a set of rules, conflicting rules resolved and the resultant rules merged into one set. Results from cross validation experiments on a data set suggest this approach may be effectively applied to large sets of data.