Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
data_mining:logistic_regression [2014/07/20 16:49] – [Adressing Overfitting] phreazer | data_mining:logistic_regression [2018/05/10 17:46] – [Regularization] phreazer | ||
---|---|---|---|
Line 56: | Line 56: | ||
Z.B. aus 3-Klassenproblem 3 binäre Probleme erzeugen. $h_\theta(x)^{(i)} = P(y=i|x; | Z.B. aus 3-Klassenproblem 3 binäre Probleme erzeugen. $h_\theta(x)^{(i)} = P(y=i|x; | ||
- | Dann wähle Klasse i, die $max_i h_\theta^{(i)}(x)$ | + | Dann wähle Klasse i, die $\max_i h_\theta^{(i)}(x)$ |
===== Adressing Overfitting ===== | ===== Adressing Overfitting ===== | ||
Line 74: | Line 74: | ||
$min \dots + \lambda \sum_{i=1}^n \theta_j^2$ | $min \dots + \lambda \sum_{i=1}^n \theta_j^2$ | ||
+ | === L2 Regularization === | ||
+ | |||
+ | For large $\lambda$, $W^{[l]} => 0$ | ||
+ | |||
+ | $J(W^{[l]}, | ||
+ | |||
+ | This results in a **simpler** network / each hidden unit has **smaller effect**. | ||
+ | |||
+ | Another effect, wehn $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear. | ||
+ | |||
+ | === Dropout === | ||
+ | |||
+ | |||
+ | |||
+ | === Gradient descent (Linear Regression) === | ||
+ | |||
+ | $\dots + \lambda / m \theta_j$ | ||
+ | |||
+ | $\theta_j := \theta_j (1-\alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum^m_{i=1} (h_\theta(x^{(i)} - y^{(i)}) x_j^{(i)}$ | ||
+ | |||
+ | |||
+ | === Normalengleichung (Linear Regression) === | ||
+ | $$(x^T X + \lambda | ||
+ | \begin{bmatrix} | ||
+ | 0 & \dots & \dots & 0 \\ | ||
+ | \vdots & 1 & 0 & \vdots \\ | ||
+ | \vdots & 0 & \ddots & 0 \\ | ||
+ | 0 & \dots & 0 & 1 | ||
+ | \end{bmatrix})^{-1} X^T y$$ | ||
+ | |||
+ | === Gradient descent (Logistic Regression) === | ||
+ | |||
+ | Unterscheide $\theta_0$ und $\theta_j$! | ||
+ | |||
+ | Für $\theta_j$: $\dots + \frac{\lambda}{m} \theta_j$ |