Gradient Boosting Decision Tree

Haozhe Xie

October 9, 2016

13 min read

In the previous article, we’ve talked about AdaBoost which combines output of weak learners into a weighted sum that represents the final output of the boosted classifier. If you know little about AdaBoost or additive model, we highly recommend you read the article first.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Boosting Tree

Boosting tree is based on additive model which can be represented as follows:

$$\begin{align}f_M(x) = \sum_{m=1}^{M} T(x; \theta_m)\end{align}$$

where $T(x; \theta_m)$ stands for a decision tree, $M$ is the number of decision trees and $\theta_m$ is the parameters of the m-th decision tree.

Assume that the initial boosting tree $f_0(x) = 0$; the m-th step of this model is

$$\begin{align}f_m(x) = f_{m-1}(x) + T(x; \theta_m)\end{align}$$

where $f_{m-1}(x)$ is the current model. The parameters of the next decision tree $\theta_m$ can be determined by minimizing the following cost function:

$$ \theta_m^* = \text{argmin}_{\theta_m} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + T(x_i; \theta_m)) $$

For a binary classification problem with exponential loss, the boosting tree reduces to the AdaBoost classifier. In this article, we mainly focus on boosting trees for regression.

Given a training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_N, y_N)\right\}$, where $\mathbf{x}_i \in \mathbb{R}^d$, $y_i \in \mathbb{R}$. Assume that the input space has been divided into separate and disjoint regions $R_1, R_2, \dots, R_J$, the boosting tree can be formulated as:

$$\begin{align}T(x; \theta) = \sum_{j=1}^J c_j I(x \in R_j)\end{align}$$

where $c_j$ is a constant weight of the region, $\theta = \{(R_1, c_1), (R_2, c_2), \dots, (R_J, c_J)\}$ stands for the separation of input space, and $J$ is the number of leaf nodes of the regression tree.

As mentioned above, the boosting trees of regression problems fit forward stagewise modeling. Here we use squared error as loss function:

$$\begin{align}L(y, f(x)) = (y - f(x))^2\end{align}$$

The cost can be calculated as:

$$ \begin{align} L(y, f_{m-1}(x) + T(x; \theta_m)) &= (y - f_{m-1}(x) - T(x; \theta_m))^2 \\ &= (r - T(x; \theta_m))^2 \end{align} $$

where $r = y - f_{m-1}(x)$ which is the residual of the current model.

The procedure of boosting trees is listed below:

Input: training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_N, y_N)\right\}$, where $\mathbf{x}_i \in \mathbb{R}^d$, $y_i \in \mathbb{R}$.

Output: boosting tree $f_M(x)$

Initialize $f_0(x) = 0$
For $m = 1, 2, \dots, M$
- Calculate residual according to
  $r_{mi} = y_i - f_{m-1}(x_i), i = 1, 2, \dots, N$
- Obtain $T(x, \theta_m)$ by fitting residuals
- Update $f_m(x) = f_{m-1}(x) + T(x; \theta_m)$
Obtain boosting tree for regression problem
$f_M(x) = \sum_{m=1}^M T(x; \theta_m)$

An Example of Boosting Tree

The training examples are given below:

Table 1. Training examples


$x_i$	1	2	3	4	5	6	7	8	9	10
$y_i$	5.56	5.70	5.91	6.40	6.80	7.05	8.90	8.70	9.00	9.05

Consider following optimization problem:

$$\begin{align} \min_s\left[\min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2\right]\end{align}$$

where $R_1 = \{x | x \le s\}, R_2 = \{x | x > s\}$, $c_1 = \frac{1}{N_1} \sum_{x_i \in R_1} y_i,c_2 = \frac{1}{N_2} \sum_{x_i \in R_2} y_i$, and $N_1, N_2$ are the number of points in $R_1, R_2$ respectively.

Consider following candidate values for $s$: $1.5, 2.5, 3.5, \dots, 9.5$.

$$\begin{align}m(s) = \min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2\end{align}$$

Take $s = 1.5$ as an example: $R_1 = \{1\}, R_2 = \{2, 3, \dots, 10\}$, and $c_1 = 5.56, c_2 = 7.50$. Therefore, $m(s) = m(1.5) = 0 + 15.72 = 15.72.$

The results of $s$ and $m(s)$ are listed below:

Table 2. m(s) values of different values of s


$s$	1.5	2.5	3.5	4.5	5.5	6.5	7.5	8.5	9.5
$m(s)$	15.72	12.07	8.36	5.78	3.91	1.93	8.01	11.73	15.74

Obviously, $s = \text{argmin}_s m(s) = 6.5$. Corresponding $R_1 = \{1, 2, \dots, 6\}$, $R_2 = \{7, 8, 9, 10\}$, $c_1 = 6.24$, $c_2 = 8.91$. So regression tree $T_1(x)$ is

$$T_1 (x) = \begin{cases}6.24, & x < 6.5 \\ 8.91, & x \ge 6.5\end{cases}$$$$f_1(x) = T_1(x)$$

The residual of $f_1(x)$ is listed below, where $r_{2i} = y_i - f_1(x_i), i = 1, 2, \dots, 10$.

Table 3. Residual values


$x_i$	1	2	3	4	5	6	7	8	9	10
$r_{2i}$	-0.68	-0.54	-0.33	0.16	0.56	0.81	-0.01	-0.21	0.09	0.14

The cost of $f_1(x)$ is $L(y, f_1(x)) = \sum_{i=1}^{10} (y_i - f_1(x_i))^2 = 1.93$.

Similar to above steps, $T_2(x)$ can be calculated by fitting values in Table 3:

$$T_2 (x) = \begin{cases}-0.52, & x < 3.5 \\ 0.22, & x \ge 3.5\end{cases}$$

Then $f_2(x) = f_1(x) + T_2(x) = \begin{cases}5.72, & x < 3.5 \\ 6.46, & 3.5 \le x < 6.5 \\ 9.13, & x \ge 6.5\end{cases}$

The cost of $f_2(x)$ is $L(y, f_2(x)) = \sum_{i=1}^{10} (y_i - f_2(x_i))^2 = 0.79$.

We can obtain following values using similar way:

$$T_3 (x) = \begin{cases}0.15, & x < 6.5 \\ -0.22, & x \ge 6.5\end{cases}, L(y, f_3(x)) = 0.47$$$$T_4 (x) = \begin{cases}-0.16, & x < 4.5 \\ 0.11, & x \ge 4.5\end{cases}, L(y, f_4(x)) = 0.30$$$$T_5 (x) = \begin{cases}0.07, & x < 6.5 \\ -0.11, & x \ge 6.5\end{cases}, L(y, f_5(x)) = 0.23$$$$T_6 (x) = \begin{cases}-0.15, & x < 2.5 \\ 0.04, & x \ge 2.5\end{cases}, L(y, f_6(x)) = 0.17$$

Suppose the cost $0.17$ is below our error threshold, so we stop here. The final regression tree is the sum of the six trees:

$$f(x) = f_6(x) = f_5(x) + T_6(x) = \begin{cases} 5.63, & x < 2.5 \\ 5.82, & 2.5 \le x < 3.5 \\ 6.56, & 3.5 \le x < 4.5 \\ 6.83, & 4.5 \le x < 6.5 \\ 8.95, & x \ge 6.5 \end{cases}$$

Gradient Boosting

Look back at the boosting tree procedure: at every step we fit the new tree to the residual $r = y - f_{m-1}(x)$. This shortcut works only because we chose squared error as the loss. Differentiating it with respect to the current prediction gives

$$\begin{align}\frac{\partial L(y, f(x))}{\partial f(x)} = \frac{\partial (y - f(x))^2}{\partial f(x)} = -2\big(y - f(x)\big)\end{align}$$

so the residual is, up to a constant factor, the negative gradient of the loss. Fitting residuals is therefore really a steepest-descent step toward minimizing the cost.

For a more complex loss function the residual no longer has such a tidy form, and minimizing the cost directly becomes hard. Friedman’s insight was that the negative gradient is always available, so we can fit each new tree to it instead:

$$\begin{align}-\left[\frac{\partial L(y, f(x_i))}{\partial f(x_i)}\right]_{f(x) = f_{m-1}(x)}\end{align}$$

which gives the best steepest-descent step direction in the data space at $f_{m-1}(x)$.

The procedure of gradient boosting decision trees is listed below:

Input: training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_N, y_N)\right\}$, where $\mathbf{x}_i \in \mathbb{R}^d$, $y_i \in \mathbb{R}$.

Output: gradient boosting tree $f_M(x)$

Initialize $f_0(x) = \text{argmin}_{c} \sum_{i=1}^N L(y_i, c)$, where $c$ is a constant.
For $m = 1, 2, \dots, M$
- For $i = 1, 2, \dots, N$ compute
  $r_{mi} = -\left[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\right]_{f(x) = f_{m-1}(x)}$
- Fit a regression tree to the targets $r_{mi}$ giving terminal regions $R_{mj}, j = 1, 2, \dots, J$.
- For $j = 1, 2, \dots, J$ compute
  $c_{mj} = \text{argmin}_c \sum_{x_i \in R_{mj}} L(y_i, f_{m-1}(x_i) + c)$
- Update $f_m(x) = f_{m-1}(x) + \sum_{j=1}^J c_{mj} I(x \in R_{mj})$
Output the gradient boosting tree
$f_M(x) = \sum_{m=1}^M \sum_{j=1}^J c_{mj} I(x \in R_{mj})$

When the loss is squared error, $r_{mi} = -\left[\partial L / \partial f(x_i)\right] = 2\big(y_i - f_{m-1}(x_i)\big)$ is just twice the residual, so the worked example above is already gradient boosting, the special case we started from. The real payoff of the gradient view is that swapping in another differentiable loss yields a whole family of algorithms without changing the procedure: absolute error for robustness to outliers, or log-loss for classification. This is the same foundation that modern libraries such as XGBoost and LightGBM build on.

Those libraries, and Friedman’s original algorithm, add one more practical ingredient: a learning rate (also called shrinkage) $\nu \in (0, 1]$ that scales every tree before it is added to the model,

$$f_m(x) = f_{m-1}(x) + \nu \sum_{j=1}^J c_{mj} I(x \in R_{mj})$$

A smaller $\nu$ takes more cautious steps (each tree corrects only a fraction of the residual), so more trees are needed, but the ensemble almost always generalizes better. The learning rate $\nu$ and the number of trees $M$ trade off against each other and are tuned together; this shrinkage is one of the main reasons gradient boosting resists overfitting.

References

Hang Li. Statistical Learning Methods. Tsinghua University Press (2012)
Friedman, Jerome H. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001): 1189-1232.
https://en.wikipedia.org/wiki/Gradient_boosting

The Disqus comment system is loading ...
If the message does not appear, please check your Disqus configuration.