data_mining:xgboost

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:xgboost [2016/04/24 16:33] – [CART] phreazerdata_mining:xgboost [2020/08/02 16:12] (current) phreazer
Line 1: Line 1:
 ====== XGBoost ====== ====== XGBoost ======
-XGBoost is short for “Extreme Gradient Boosting +//Extreme Gradient Boosting//
-Literature Greedy Function Approximation: A Gradient Boosting Machine, by Friedman+
  
 +Literature: Greedy Function Approximation: A Gradient Boosting Machine, by Friedman
  
 ===== CART ===== ===== CART =====
Line 20: Line 20:
 $$ $$
  
-$F$ is space of functions containing all regression trees +===== Gradient boosting ===== 
-$K$ is number of trees + 
-$f_k(x_i)$ is regression tree that maps a attribute to a score+  * $F$ is space of functions containing all regression trees 
 +  $K$ is number of trees 
 +  $f_k(x_i)$ is regression tree that maps a attribute to a score
  
 Learn functions (trees) instead of weights in $R^d$. Learn functions (trees) instead of weights in $R^d$.
Line 33: Line 35:
  
 Learning objective: Learning objective:
-  * Training loss: Fit of the functions to the points +  * **Training loss**: Fit of the functions to the points 
-  * Regularization: Complexity of function; Number of splitting points, l2 norm of height in each segment+  * **Regularization**: Complexity of function; Number of splitting points, l2 norm of height in each segment
  
 Objective: Objective:
  
 $$ $$
-O = \sum^n_{i=1} l(y_i,\hat{y}_i) + \sum^K_{k=1} \omega(f_k)+O = \sum^n_{i=1} l(y_i,\hat{y}_i) + \sum^K_{k=1} \Omega(f_k)
 $$ $$
  
   * $l$ is training loss   * $l$ is training loss
-  * $\omega$ is complexity (#nodes in tree, depth, l2 norm of leaf weights,...)+  * $\Omega$ is complexity (#nodes in tree, depth, l2 norm of leaf weights,...)
  
 => Objective is predictive and simple functions => Objective is predictive and simple functions
Line 57: Line 59:
   * Logistic loss $l(y_i,\hat{y}_i)=y_i \ln(1+e^{-\hat{y}_i})+(1-y_i)\ln(1+e^{\hat{y}_i})$ (LogitBoost)   * Logistic loss $l(y_i,\hat{y}_i)=y_i \ln(1+e^{-\hat{y}_i})+(1-y_i)\ln(1+e^{\hat{y}_i})$ (LogitBoost)
  
-Stochastic Gradient Descent can not be applied, since trees are used.+Stochastic Gradient descent can not be applied, since trees are used.
  
 Solution is **additive training**: Start with constant prediction, add a new function each time. Solution is **additive training**: Start with constant prediction, add a new function each time.
Line 75: Line 77:
 $$ $$
 \sum^n_{i=1} l(y_i,\hat{y}_i^{(t-1)}+f_t(x_i)) + \Omega(f_t) + const\\ \sum^n_{i=1} l(y_i,\hat{y}_i^{(t-1)}+f_t(x_i)) + \Omega(f_t) + const\\
-=\sum^n_{i=1} (y_i -(\hat{y}_i^{(t-1)}+f_t(x_i)))^2 + \Omega(f_t) + const+=\sum^n_{i=1} (y_i -(\hat{y}_i^{(t-1)}+f_t(x_i)))^2 + \Omega(f_t) + const\\ 
 +=\sum^n_{i=1} [2(\hat{y}_i^{(t-1)} -y_i)f_t(x_i) + f_t(x_i)^2 + const] + \Omega(f_t) + const
 $$ $$
 +
 +
 +==== Taylor expansion ====
 +
 +Use taylor expansion to approximate a function through a power series (polynom).
 +$$Tf(x;a) = \sum_{n=0}^\infty\frac{f^{(n)}(a)}{n!}(x-a)^n$$
 +
 +is taylor series of x of dev point a; $f^{(n)}$ is n-th derivative
 +
 +(Same derivative value for dev point a)
 +
 +
 +$$f(x+\Delta x) = f(x) + f'(x)\Delta x + \frac{1}{2} f''(x)\Delta x^2$$
 +$$\sum^n_{i=1} [l(y_i,\hat{y}_i^{(t-1)}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ with $g_i=\delta_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$ and $h_i=\delta^2_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$
 +
 +With removed constants (and square loss)
 +$$\sum^n_{i=1} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t)$$ 
 +So that learning function only influences $g_i$ and $h_i$ while rest stays the same.
 +
 +Complexity $\Omega(f_t)$
 +  * $\Omega(f_t) = \gamma t + \frac{1}{2} \lambda \sum^T_{j=1} w^2_j$
 +  * Number of leaves
 +  * L2 norm of leaf scores
  • data_mining/xgboost.1461508390.txt.gz
  • Last modified: 2016/04/24 16:33
  • by phreazer