This document is an overview of tree learning & commonly associated ensemble methods.
If building this document from source, install following R packages:
install.packages(c('MASS', 'xgboost', 'rpart', 'randomForest','formula.tools'))
The iris
dataset will be used throughout this document. It was chosen due to its small size & strong correlations. Prediction tasks will focus on predicting Sepal.Length
from other predictors; this task does not necessarily highlight where trees will be most successful, but it does make demoing easy, which is the goal.
pairs(iris[1:4], pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
The following sections cover the structure of tree models & how predictions are drawn from them following parameterization. Fitting and inferential techniques are covered later.
Trees are a data structure consisting of nodes & directed edges. A tree begins with a “root” node, which has no edges leading to it. Nodes without directed edges leading away are generally referred to as “leaf nodes”. Nodes with directed edges leading away contain a rule which describes how the node should be traversed (which directed edge to follow). Leaf nodes contain a value, which represent the end result of the tree travseral.
Predictions are drawn from trees by dropping an observation into the root node. The tree is traversed by following the rules at nodes until a leaf node is reached. The prediction is the value at the leaf node.
During prediction, missing values can be handled several ways:
During training, if surrogate splits are being used, missing observations are ignored in computing primary splits and missing observations are pushed down the tree using the surrogates. If the other methods are being used, the observations may be randomly (proportional to observations per branch) sent down the tree or “fractionally” sent down both branches in the tree (again, proportional to observations per branch)
Notice that none of these approaches rely on imputation, which is a somewhat unique feature to trees.
A tree ensemble is a collection of trees. To predict from an ensemble, the trees are each evaluated as above, potentially with some weighting, and aggregated (often with a sum or mean).
The following are some of the generic algorithms used to fit trees. All instances include a base R implementation for clarity (i.e. educational, not intended for production use).
The most common algorithm for fitting a single tree is known as CART (Classification And Regression Trees). The general approach, from any inner node in the tree:
To fit a whole tree, this algorithm can be performed in O(pnlog(n))
, meaning it scales nearly linearly with predictors and observations.
Here’s an implementation of the first step for a regression tree with squared loss:
splitSearch <- function(data, response) {
min_split_feature <- NULL
min_split_point <- NULL
min_split_error_sum <- Inf
for (feature in colnames(data)) {
feature_values <- data[[feature]]
for (split_point in feature_values) {
left_ix <- which(feature_values <= split_point)
right_ix <- which(feature_values > split_point)
if (length(left_ix) == 0 || length(right_ix) == 0) next
left_response <- response[left_ix]
right_response <- response[right_ix]
split_error_sum <- sum((left_response - mean(left_response))^2) +
sum((right_response - mean(right_response))^2)
if (split_error_sum <= min_split_error_sum) {
min_split_feature <- feature
min_split_point <- split_point
min_split_error_sum <- split_error_sum
}
}
}
error_improvement <- sum((response - mean(response))^2) - min_split_error_sum
return(list(min_split_feature, min_split_point, error_improvement))
}
(split <- splitSearch(iris[, c('Sepal.Width', 'Petal.Length', 'Petal.Width')], iris$Sepal.Length))
## [[1]]
## [1] "Petal.Length"
##
## [[2]]
## [1] 4.2
##
## [[3]]
## [1] 62.67643
From here out, the rpart
package is used to fit individual trees for convenience; this function is similar to the above approach under the hood.
Bagging is a simple ensemble method. Bootstrap (sample the data with replacement) the data and fit a model to each replicate. Here’s an implementation:
tree_bagging <- function(formula, data, bootstraps,
control = rpart.control()) {
trees <- lapply(1:bootstraps, function(bootstrap_ix) {
resample <- data[sample(1:nrow(data), nrow(data), replace = T),]
rpart::rpart(formula, resample, control = control)
})
structure(list(trees = trees),
class = 'tree_bag')
}
predict.tree_bag <- function(object, newdata, ...) {
predictions <- sapply(object$trees, function(tree) {
predict(tree, newdata = newdata)
})
rowMeans(predictions)
}
prediction_form <- Sepal.Length ~ Sepal.Width + Species + Petal.Length
# example usage
m_bag <- tree_bagging(prediction_form, iris, 10,
control = rpart.control(maxdepth = 6, minsplit = 2, cp = .001))
bag_rmse <- rmse(predict(m_bag, iris), iris$Sepal.Length)
Note that bagging is model agnostic, but frequently applied in the context of decision trees.
Random Forest is nearly the same as tree bagging, with one modification. At each split point, only a random subsample of features are made available to the split search.
Here’s an iteration on the CART splitSearch
function that could be used to fit an individual tree in the RF. Note the only change is in the addition of split_column_sample
.
splitSearchRF <- function(data, response, column_sample_proportion = .5) {
min_split_feature <- NULL
min_split_point <- NULL
min_split_error_sum <- Inf
split_column_sample <- sample(ncol(data), ceiling(ncol(data) * column_sample_proportion))
for (feature in colnames(data)[split_column_sample]) {
feature_values <- data[[feature]]
for (split_point in feature_values) {
left_ix <- which(feature_values <= split_point)
right_ix <- which(feature_values > split_point)
if (length(left_ix) == 0 || length(right_ix) == 0) next
left_response <- response[left_ix]
right_response <- response[right_ix]
split_error_sum <- sum((left_response - mean(left_response))^2) +
sum((right_response - mean(right_response))^2)
if (split_error_sum <= min_split_error_sum) {
min_split_feature <- feature
min_split_point <- split_point
min_split_error_sum <- split_error_sum
}
}
}
error_improvement <- sum((response - mean(response))^2) - min_split_error_sum
return(list(min_split_feature, min_split_point, error_improvement))
}
set.seed(1)
(split <- splitSearchRF(iris[, c('Sepal.Width', 'Petal.Length', 'Petal.Width')], iris$Sepal.Length))
## [[1]]
## [1] "Petal.Width"
##
## [[2]]
## [1] 1.1
##
## [[3]]
## [1] 59.29
Notice that the selected split point has changed (and error improvement smaller) from the CART implementation, since a different feature set was available to the split.
The general approach is to:
In R (note a constant learning rate is used, while many applications prefer an adaptive/decaying weighting):
tree_boosting <- function(formula, data, n_trees, learning_rate,
control = rpart.control()) {
predictions <- rep(0, nrow(data))
response_col <- lhs.vars(formula)
data$residuals <- data[[response_col]]
residual_formula <- update(formula, residuals ~ .)
trees <- vector('list', n_trees)
for (tree_ix in 1:n_trees) {
boost <- rpart(residual_formula, data = data, method = 'anova', control = control)
trees[[tree_ix]] <- boost
boost_predictions <- learning_rate * predict(boost, data, type = 'vector')
predictions <- predictions + boost_predictions
data$residuals <- data$residuals - boost_predictions
}
structure(list(trees = trees,
learning_rate = rep(learning_rate, n_trees)),
class = 'tree_boost')
}
predict.tree_boost <- function(object, newdata, ...) {
predictions <- sapply(object$trees, function(tree) {
predict(tree, newdata, type = 'vector')
})
predictions %*% object$learning_rate
}
# example usage
m_boost <- tree_boosting(prediction_form, iris, 100, .1,
control = rpart.control(maxdepth = 3, minsplit = 2))
boost_rmse <- rmse(predict(m_boost, iris), iris$Sepal.Length)
AdaBoost is not implemented due to overlap with the previous and next boosting methods, but the main idea is that AdaBoost reweights observations according to how poorly they were predicted up to the prior round of boosting.
Gradient boosting is a generalization of AdaBoost with very similar structure to Forward Stagewise Boosting. The general procedure:
Note that the gradient of the loss function is analytically specified to algorithm.
Here’s an implementation of gradient boosting for LAD regression:
lad_gradient_boosting <- function(formula, data, n_trees,
control = rpart.control()) {
response_col <- lhs.vars(formula)
residual_formula <- update(formula, residuals ~ .)
# faster convergence, but more complex code to initialize this to median value
predictions <- rep(0, nrow(data))
trees <- vector('list', n_trees)
learning_rate <- rep(0, n_trees)
for (tree_ix in 1:n_trees) {
# derivative of L1 loss is {-1, 1} with split occuring at 0
pseudo_residuals <- sign(data[[response_col]] - predictions)
data$residuals <- pseudo_residuals
boost <- rpart::rpart(residual_formula, data = data, method = 'anova',
control = rpart.control())
boost_predictions <- predict(boost, newdata = data, type = 'vector')
# we may not expect this optimization problem to be convex, so simulated annealing is used
step_size <- optim(fn = function(candidate_step) {
sum(abs(data[[response_col]] - (predictions + candidate_step * boost_predictions)))
}, par = 1, method = 'SANN')$par
predictions <- predictions + step_size * boost_predictions
trees[[tree_ix]] <- boost
learning_rate[[tree_ix]] <- step_size
}
structure(list(trees = trees,
learning_rate = learning_rate),
class = 'tree_boost')
}
# example usage
m_grad_boost <- lad_gradient_boosting(prediction_form, iris, 15,
control = rpart.control(maxdepth = 3, minsplit = 2))
grad_boost_l1 <- mean(abs(predict(m_grad_boost, iris) - iris$Sepal.Length))
In model selection & parameterization, the bias-variance tradeoff provides intuition as to how a model may perform & generalize. Familiarize yourself with this paradigm before proceeding.
A single decision tree is a very flexible method - it is a universal function approximator. This is easy to imagine if we carry a tree to the extreme case of one leaf per unique training example. In less extreme cases (i.e. where a practitioner desires a generalizable model), a single tree can still be effective in instances where the data generating mechanism operates on non-continuous & non-overlapping subspaces. Regularization techniques such as pruning and split limiting rules make single trees even more feasible.
The deeper a single tree gets:
With bagging, the data generating mechanism is simulated and the model fitted to the bootstrap replicate represents a model fit to a generalized sample from the generating mechanism. By averaging predictions, it may be possible to reduce prediction variance at (nearly) fixed bias. The former claim is a consequence of the Law of Large Numbers, and the latter is simply derived as the expectation of the sample average of R.V.s with equal mean.
A final note - observing the bias/variance curves above - bagging can only be effective if the individual models in the bag have low bias; the opposite case (low variance, high bias) implies that there isn’t much error to improve by reducing variance.
See the experiments below for an simplified example of how bagging may be effective.
The motivation for Random Forest is similar to Bagging, with one tweak. Often in practice, the bagged models (fit to bootstrap replicates) are highly correlated. Consider the case of training data with a clear hierarchy of “importance” of predictors; models across bootstrap replicates may contain very similar split points if the signal of individual predictors is stronger than the variability introduced by the bootstrap. This is a problem because it means none of the models are biased (attacking the error from different directions), and that aggregating the models isn’t shrinking the variance. This is simple to intuit in the case of perfectly correlated trees (all splits are equal) - bagging the models results in the same as an individual fit.
Random Forest aims to decorrelate the trees, without materially sacrificing predictive accuracy (any individual tree should have only marginally increased bias), so that aggregating them results in a larger variance reduction. Since statistics are hard, consider the MC experiment below which demonstrates this effect (namely, variance of a mean shrinks as covariance of the random sample shrinks):
## generates aggregated "predictions", which are arbitrarily chosen to be means of gausian random sample
generate_predictions <- function(n_models = 50,
covariance = .9,
n_observations = 100) {
# all models predict with variance 1 & have equal covariance to all other models
covariances <- diag(1, n_models, n_models)
for (i in 1:n_models)
for (j in 1:n_models)
if (i != j)
covariances[i, j] <- covariance
predictions <- mvrnorm(n_observations, mu = rep(0, n_models), Sigma = covariances)
rowMeans(predictions)
}
n_trials <- 25
set.seed(55414)
## this is a demonstration of the gain of Bagging
## simulate drawing predictions from bags of varying size (including no bagging)
no_bagging_predictions <- sapply(1:n_trials, function(ix){ generate_predictions(n_models = 1) })
little_bagging_predictions <- sapply(1:n_trials, function(ix){ generate_predictions(n_models = 10) })
many_bagging_predictions <- sapply(1:n_trials, function(ix){ generate_predictions(n_models = 50) })
## compute the average (across our observations) variance of aggregated predictions across all the MC trials
plot(log(c(1, 10, 50)), c(
mean(apply(no_bagging_predictions, 1, var)),
mean(apply(little_bagging_predictions, 1, var)),
mean(apply(many_bagging_predictions, 1, var))
), xlab = 'Log number of bagged models', ylab = 'Variance in predictions')
## this is a demonstration of the gain of Random Forest
## simulate drawing predictions from bags with different degrees of correlation
high_correlation_predictions <- sapply(1:n_trials, function(ix){ generate_predictions(covariance = .9) })
middle_correlation_predictions <- sapply(1:n_trials, function(ix){ generate_predictions(covariance = .5) })
low_correlation_predictions <- sapply(1:n_trials, function(ix){ generate_predictions(covariance = .1) })
plot(c(.9, .5, .1), c(
mean(apply(high_correlation_predictions, 1, var)),
mean(apply(middle_correlation_predictions, 1, var)),
mean(apply(low_correlation_predictions, 1, var))
), xlab = 'Model prediction covariance', ylab = 'Variance in predictions')
By limiting the feature set available to a split, we ensure that the structure of trees varies to a larger extent, decorrelating the trees & their predictions.
For an excellent exposition of this concept, see chapter 8 of ISLR or chapter 9 of ESLR.
Boosted trees are often effective because:
In general, trees ensembles suffer from poor interprettability. Model agnostic methods offer general approaches for inference & interpretation of arbitrary models, but there are several tree specific techniques that can be used:
Feature importance for a fitted ensemble (or single tree, for that matter) can be computed for a feature x
using the following approach:
x
Note that this method depends on the data used to evaluate the model. Furthermore, the more parsimonious a model is, the less variable these importances will be (suggesting that a model validation should be run before feature importance analysis).
Examples of this technique will follow below.
For small trees and ensembles, it may be reasonable to inspect all trees. However, this is not possible in many cases.
One way to interpret an ensemble is to smoosh all the trees together, relying on them having similar structure, and view the most common features at each split point. The value at each feature is the summed gain or summed prediction for leaf nodes.
Examples of this technique will follow below.
One model agnostic approach is to bag and build empirical intervals for an observation using each model prediction. This is computationally expensive but simple.
A tangential (and unsupported by literature) approach involves the RuleFit method:
Parametric CIs can now be drawn from the fitted linear model; the CI may not be consistent with estimates drawn from the fitted ensemble, so be careful with this approach.
Here are several R libraries that can be used for fitting tree learners. Certainly there are many more.
XGBoost is a gradient boosted tree library implemented in C, with APIs in several languages, including R.
design_matrix <- model.matrix(prediction_form, iris)
m_xgb <- xgboost(design_matrix, label = iris$Sepal.Length,
objective = "reg:linear", nrounds = 20, verbose = 0,
params = list(max_depth = 2))
A few notes:
nrounds
determines the number of trees in the ensembleparam
can be a range of configurations in the param
. These can be used for tuning the fit of individual trees (e.g. column sampling as in RF, minimum loss rediction for splitting, tree depth, etc.)objective
can be a character string of predfined loss functions or a user supplied loss functionxgboost::DMatrix
object instead of an R matrix to avoid repeated copieshist(predict(m_xgb, design_matrix), main = 'Train set Petal.Length predictions')
To plot the ensemble (using the trees
argument to keep the output managable):
xgb.plot.tree(model = m_xgb, trees = 1:3)
To interpret:
Cover
is the number of train set observations handled at the splitGain
is the improvement in the loss function made by the splitValue
at leaf nodes is the boost predictionHere’s a “multi-plot” for visualizing structure:
xgb.plot.multi.trees(m_xgb)
## Column 2 ['No'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
To plot feature importances:
imp <- xgb.importance(model = m_xgb)
xgb.plot.importance(imp)
m_rf <- randomForest(prediction_form, iris, ntree = 100, importance = TRUE)
hist(predict(m_rf, iris), main = 'Train set Random Forest predictions')
importance(m_rf)
## %IncMSE IncNodePurity
## Sepal.Width 8.383693 9.667313
## Species 9.816303 25.262498
## Petal.Length 15.732319 52.338743