Decision tree


Decision trees are used for classification and value prediction as an alternative to least-squares and logistic regression methods. Instead of generating a complex mathematical function to predict classifications and values, a tree is generated where a decision is made at each fork about to which path to take based on the values of one or more of the predictor variables.

There are a large number of different decision tree algorithms. The most important distinctions between the various algorithms are:

  • whether the dependent variable being predicted can take one of several discrete values (in which case each value is predicted by a journey through the decision tree ending up at one or more leaf nodes; this is the normal case) or be a scalar value like a length or temperature (in which case each leaf node has its own, normally fairly simple, regression function to obtain the dependent value; algorithms that allow or prescribe this include M5, CART and CUBIST);
  • whether the decisions made at each node within the tree are based on single variables (the standard case) or can themselves involve regression functions (CRUISE, GUIDE and QUEST);
  • how the algorithm selects the predictor variable to split on at each node as it builds up the tree. The general idea is that the variables that more effectively split the data into groups should appear further to the left / to the top within the tree, but different algorithms use different mathematical functions to discover this, including the Gini index (CART); information entropy theory (ID3, C4.5 and C5.0); chi-squared independence (Chi-squared automatic interaction detection / CHAID); and comparing the performance of a proposed rule when applied to the data with its performance when applied to a randomly permutated version of the same data where the rule should not be expected to work (conditional inference trees);
  • whether interactions between predictor variables (collinearity, see explanation here) are taken into account when determining the tree structure;
  • whether only two, or three or more, paths can emerge from a fork node;
  • whether the tree learned by an algorithm only stops growing once there is nothing left to learn from the data or whether there is some artificial bound placed on its size;
  • whether a generated tree is pruned in a second phase after the initial tree has been created. The aim of pruning is to find and eliminate fork nodes that emerge as being unimportant to the final results because their function is duplicated by other nodes further down the tree.

    One algorithm, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), divides its training data into a growing set and a pruning set. A tree is then generated that perfectly fits the growing set (i.e. deliberate overlearning), then all fork nodes that have a negative effect on the pruning set are then removed from the tree. This procedure is then repeated until the tree structure stops changing.

A good summary and comparison that is not too long or complicated and that deals with many of the variants listed above can be found here.

Decision trees tend to be more vulnerable to overfitting the test data than regression methods. The fact that there are often many trees that would work for a particular set of data also means that a small change to the test data can easily result in a completely different tree structure. However, the great advantage of decision trees is that the models they create are easily understandable and can also be altered easily by the data scientist where there is a clear theoretical basis for doing so.

CART ID3 C4.5 C5.0 M5 GUIDE CRUISE QUEST Chi-squared Automatic Interaction Detection (CHAID) Conditional inference tree CUBIST Repeated Incremental Pruning to Produce Error Reduction RIPPER Random forest
has functional building block
FBB_Classification FBB_Value prediction
has input data type
IDT_Vector of categorical variables IDT_Vector of quantitative variables
has internal model
has output data type
ODT_Classification ODT_Quantitative variable
has learning style
has parametricity
PRM_Nonparametric PRM_Nonparametric with hyperparameter(s)
has relevance
ALG_One Rule
sometimes supports
mathematically similar to
ALG_Multivariate adaptive regression splines