This project is about decision trees in the context of large data sets. To achieve more fundamental insights on how to select split criteria, we propose a formal methodology that permits to compare multiple split criteria. We introduce a new family of split functions, which have a more stable behavior than the classical split criteria when larger and larger data sets are used to construct decision trees. Our new technique of sampling applied to large data sets guarantees that the decision trees inferred from the whole database as well as from the sampled database are almost always identical.
The grand challenge of Knowledge Discovery in Databases (KDD) is to automatically process large quantities of raw data, identify the most significant and meaningful patterns, and present this knowledge in an appropriate form for achieving the user's goal. Knowledge discovery systems face challenging problems from the real databases that tend to be very large, redundant, noisy and dynamic.
The classification process (which finds a set of models or functions that describe and distinguish data classes or concepts) is done for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
A powerful tool, used in classification and prediction, are decision trees. A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions. To select the test attribute for a current node different measures of the goodness of split (Gini Index, Information Gain) split criteria are used. Decision trees can be easily converted to classification rules.
Our research is concentrated on decision trees in the context of large data sets. In general, three main questions about the induction of decision trees are investigated:
- Will the Gini Index and Information Gain split criteria select the same attribute to split on in any situation? If not, what are the conditions satisfied by the database's parameters?
- What will be the behavior of the Gini Index and Information Gain split functions when the size of the database from which the decision trees are inferred is increased? Will exist other split functions with a more stable behavior in the case of increasing size of databases?
- Will the trees built from large sets of examples, using the Gini Index and information Gain criteria, be equivalently to those constructed from the sampled sets of examples? What is the suitable procedure of sampling to be applied in order to keep the same structure and accuracy of the decision trees inferred from the un-sampled data sets?
The given answers are the following:
- A formal methodology is introduced us to analytically compare multiple split criteria.
- The efficiency of existing decision tree algorithms has been well established for relatively small data sets. Efficiency and scalability become issues of concern when these algorithms are applied to the mining of very large real-world databases. A new family of split functions is introduced and we are able to show, that some split functions of our family behave in a very stable way and, furthermore, the created trees were superior to the trees inferred using the classical split criteria.
- The enormity of some data sets used in recent practical applications prompts an investigation of whether such training sets are necessary and how they might be reduced without sacrificing accuracy. Basing the construction of decision trees using our new technique of sampling we assure the creation of trees equivalent in structure and quality with those we would have obtained if the whole database would had been used. This technique of compaction of examples of the data set used before the application of decision tree algorithm is clearly a useful weapon to have when attacking huge data sets.
Persons and institutions
 K. Stoffel and L. Raileanu, "Selecting Optimal Split Functions for Large Data Sets", in Research and Development in Intelligent Systems XVII, Proceedings of SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence , Cambridge, GB, 2000.
 L. Raileanu and K. Stoffel, "Formal Specification of a Statistical Sampling Methode for Decision Tree Induction", Proceedings of ICSC Congress on Computational Intelligence: Methods & Applications, 2001.
 L. Raileanu and K. Stoffel, "Theoretical Comparison between Gini Index and Information Gain Criteria", Proceedings of Seventh International Symposium on Artificial Intelligence and Mathematics, Ft. Lauderdale, 2002.
 L. E. Raileanu and K. Stoffel, "Theoretical comparison between the Gini Index and Information Gain criteria", Annals of Mathematics and Artificial Intelligence no. 41, vol 1, May 2004, pp. 77-93.