XGBoost
Extreme Gradient Boosting
Literature: Greedy Function Approximation: A Gradient Boosting Machine, by Friedman
CART
CART = Classification and regression tree.
Decision tree with scores in each leaf value.
Regression tree ensembles: Prediction is sum of all scores predcited by each trees.
Properties:
- Invariant to scaling of inputs
- Learn higher order interaction between features
- Scalable
$$ \hat{y}_i = \sum^K_{k=1} f_k(x_i), f_k \in F $$
Gradient boosting
- $F$ is space of functions containing all regression trees
- $K$ is number of trees
- $f_k(x_i)$ is regression tree that maps a attribute to a score
Learn functions (trees) instead of weights in $R^d$.
Define objective to optimize.
Decision tree can be considered as piecewise step function:
- Need to learn splitting positions
- Need to learn height
Learning objective:
- Training loss: Fit of the functions to the points
- Regularization: Complexity of function; Number of splitting points, l2 norm of height in each segment
Objective:
$$ O = \sum^n_{i=1} l(y_i,\hat{y}_i) + \sum^K_{k=1} \Omega(f_k) $$
- $l$ is training loss
- $\Omega$ is complexity (#nodes in tree, depth, l2 norm of leaf weights,…)
⇒ Objective is predictive and simple functions
From heuristic to objective:
- Info Gain: Training loss
- Pruning: Regularization defined by #nodes
- Max depth: Constraint of the function space
- Smoothing leaf values: L2 regularization on leaf weights
Loss functions:
- Square loss $l(y_i,\hat{y}_i)=(y_i-\hat{y}_i)^2$ (common gradient boosting)
- Logistic loss $l(y_i,\hat{y}_i)=y_i \ln(1+e^{-\hat{y}_i})+(1-y_i)\ln(1+e^{\hat{y}_i})$ (LogitBoost)
Stochastic Gradient descent can not be applied, since trees are used.
Solution is additive training: Start with constant prediction, add a new function each time.
$$ \begin{align} \hat{y}_i^{(0)}&=0\\ \hat{y}_i^{(1)}&=f_1(x_i)=\hat{y}_i^{(0)} + f_1(x_i)\\ \dots&\\ \hat{y}_i^{(t)}&=\sum^t_{k=1}f_k(x_i)=\hat{y}_i^{(t-1)} + f_t(x_i) \end{align} $$ Add this function which optimizes the objective. Find $f_t$ to minimize $$ \sum^n_{i=1} l(y_i,\hat{y}_i^{(t-1)}+f_t(x_i)) + \Omega(f_t) + const\\ =\sum^n_{i=1} (y_i -(\hat{y}_i^{(t-1)}+f_t(x_i)))^2 + \Omega(f_t) + const\\ =\sum^n_{i=1} [2(\hat{y}_i^{(t-1)} -y_i)f_t(x_i) + f_t(x_i)^2 + const] + \Omega(f_t) + const $$
Taylor expansion
Use taylor expansion to approximate a function through a power series (polynom). $$Tf(x;a) = \sum_{n=0}^\infty\frac{f^{(n)}(a)}{n!}(x-a)^n$$
is taylor series of x of dev point a; $f^{(n)}$ is n-th derivative
(Same derivative value for dev point a)
$$f(x+\Delta x) = f(x) + f'(x)\Delta x + \frac{1}{2} f''(x)\Delta x^2$$ $$\sum^n_{i=1} [l(y_i,\hat{y}_i^{(t-1)}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ with $g_i=\delta_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$ and $h_i=\delta^2_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$
With removed constants (and square loss) $$\sum^n_{i=1} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t)$$ So that learning function only influences $g_i$ and $h_i$ while rest stays the same.
Complexity $\Omega(f_t)$
- $\Omega(f_t) = \gamma t + \frac{1}{2} \lambda \sum^T_{j=1} w^2_j$
- Number of leaves
- L2 norm of leaf scores