Differences

This shows you the differences between two versions of the page.

--- data_mining:xgboost [2016/04/24 17:04] – [CART] phreazer
+++ data_mining:xgboost [2020/08/02 16:11] – phreazer
@@ Line 1: / Line 1: @@
 ====== XGBoost ======
-XGBoost is short for “Extreme Gradient Boosting
+//Extreme Gradient Boosting//
-Literature Greedy Function Approximation: A Gradient Boosting Machine, by Friedman
+Literature: Greedy Function Approximation: A Gradient Boosting Machine, by Friedman
 ===== CART =====
@@ Line 20: / Line 20: @@
 $$
-$F$ is space of functions containing all regression trees
+===== Gradient boosting =====
-$K$ is number of trees
-$f_k(x_i)$ is regression tree that maps a attribute to a score
+  * $F$ is space of functions containing all regression trees
+  * $K$ is number of trees
+  * $f_k(x_i)$ is regression tree that maps a attribute to a score
 Learn functions (trees) instead of weights in $R^d$.
@@ Line 57: / Line 59: @@
   * Logistic loss $l(y_i,\hat{y}_i)=y_i \ln(1+e^{-\hat{y}_i})+(1-y_i)\ln(1+e^{\hat{y}_i})$ (LogitBoost)
-Stochastic Gradient Descent can not be applied, since trees are used.
+Stochastic Gradient descent can not be applied, since trees are used.
 Solution is **additive training**: Start with constant prediction, add a new function each time.
@@ Line 80: / Line 82: @@
-Taylor expansion approximation of loss
+==== Taylor expansion ====
-Use taylor expansion to approximate a function through a power series.
+Use taylor expansion to approximate a function through a power series (polynom).
-$$Tf(x;a) = \sum_{n=0}^\inf\frac{f^{(n)}(a)}{n!}(x-a)^n$$
+$$Tf(x;a) = \sum_{n=0}^\infty\frac{f^{(n)}(a)}{n!}(x-a)^n$$
 is taylor series of x of dev point a; $f^{(n)}$ is n-th derivative
+(Same derivative value for dev point a)
 $$f(x+\Delta x) = f(x) + f'(x)\Delta x + \frac{1}{2} f''(x)\Delta x^2$$
 $$\sum^n_{i=1} [l(y_i,\hat{y}_i^{(t-1)}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ with $g_i=\delta_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$ and $h_i=\delta^2_{\hat{y}^{(t-1)}} l(y_i,\hat{y}^{(t-1)})$
+With removed constants (and square loss)
+$$\sum^n_{i=1} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t)$$
+So that learning function only influences $g_i$ and $h_i$ while rest stays the same.
+Complexity $\Omega(f_t)$
+  * $\Omega(f_t) = \gamma t + \frac{1}{2} \lambda \sum^T_{j=1} w^2_j$
+  * Number of leaves
+  * L2 norm of leaf scores