data_mining:xgboost

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
data_mining:xgboost [2019/05/03 00:52] phreazerdata_mining:xgboost [2019/05/03 01:13] – [Taylor expansion] phreazer
Line 1: Line 1:
 ====== XGBoost ====== ====== XGBoost ======
 //Extreme Gradient Boosting// //Extreme Gradient Boosting//
 +
 Literature: Greedy Function Approximation: A Gradient Boosting Machine, by Friedman Literature: Greedy Function Approximation: A Gradient Boosting Machine, by Friedman
  
Line 18: Line 19:
 \hat{y}_i = \sum^K_{k=1} f_k(x_i), f_k \in F \hat{y}_i = \sum^K_{k=1} f_k(x_i), f_k \in F
 $$ $$
 +
 +===== Gradient boosting =====
  
 $F$ is space of functions containing all regression trees $F$ is space of functions containing all regression trees
Line 56: Line 59:
   * Logistic loss $l(y_i,\hat{y}_i)=y_i \ln(1+e^{-\hat{y}_i})+(1-y_i)\ln(1+e^{\hat{y}_i})$ (LogitBoost)   * Logistic loss $l(y_i,\hat{y}_i)=y_i \ln(1+e^{-\hat{y}_i})+(1-y_i)\ln(1+e^{\hat{y}_i})$ (LogitBoost)
  
-Stochastic Gradient Descent can not be applied, since trees are used.+Stochastic Gradient descent can not be applied, since trees are used.
  
 Solution is **additive training**: Start with constant prediction, add a new function each time. Solution is **additive training**: Start with constant prediction, add a new function each time.
Line 79: Line 82:
  
  
-Taylor expansion approximation of loss+==== Taylor expansion ====
  
 Use taylor expansion to approximate a function through a power series (polynom). Use taylor expansion to approximate a function through a power series (polynom).
Line 92: Line 95:
 $$\sum^n_{i=1} [l(y_i,\hat{y}_i^{(t-1)}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ with $g_i=\delta_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$ and $h_i=\delta^2_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$ $$\sum^n_{i=1} [l(y_i,\hat{y}_i^{(t-1)}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ with $g_i=\delta_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$ and $h_i=\delta^2_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$
  
-With removed constants+With removed constants (and square loss)
 $$\sum^n_{i=1} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t)$$  $$\sum^n_{i=1} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t)$$ 
 So that learning function only influences $g_i$ and $h_i$ while rest stays the same. So that learning function only influences $g_i$ and $h_i$ while rest stays the same.
  • data_mining/xgboost.txt
  • Last modified: 2020/08/02 16:12
  • by phreazer