Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
data_mining:logistic_regression [2014/07/20 17:13] – [Regularization] phreazer | data_mining:logistic_regression [2018/05/10 17:46] – [Regularization] phreazer | ||
---|---|---|---|
Line 56: | Line 56: | ||
Z.B. aus 3-Klassenproblem 3 binäre Probleme erzeugen. $h_\theta(x)^{(i)} = P(y=i|x; | Z.B. aus 3-Klassenproblem 3 binäre Probleme erzeugen. $h_\theta(x)^{(i)} = P(y=i|x; | ||
- | Dann wähle Klasse i, die $max_i h_\theta^{(i)}(x)$ | + | Dann wähle Klasse i, die $\max_i h_\theta^{(i)}(x)$ |
===== Adressing Overfitting ===== | ===== Adressing Overfitting ===== | ||
Line 73: | Line 73: | ||
$min \dots + \lambda \sum_{i=1}^n \theta_j^2$ | $min \dots + \lambda \sum_{i=1}^n \theta_j^2$ | ||
+ | |||
+ | === L2 Regularization === | ||
+ | |||
+ | For large $\lambda$, $W^{[l]} => 0$ | ||
+ | |||
+ | $J(W^{[l]}, | ||
+ | |||
+ | This results in a **simpler** network / each hidden unit has **smaller effect**. | ||
+ | |||
+ | Another effect, wehn $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear. | ||
+ | |||
+ | === Dropout === | ||
+ | |||
+ | |||
=== Gradient descent (Linear Regression) === | === Gradient descent (Linear Regression) === | ||
Line 78: | Line 92: | ||
$\dots + \lambda / m \theta_j$ | $\dots + \lambda / m \theta_j$ | ||
- | Normalengleichung: | + | $\theta_j |
+ | |||
+ | |||
+ | === Normalengleichung (Linear Regression) === | ||
$$(x^T X + \lambda | $$(x^T X + \lambda | ||
\begin{bmatrix} | \begin{bmatrix} | ||
Line 86: | Line 103: | ||
0 & \dots & 0 & 1 | 0 & \dots & 0 & 1 | ||
\end{bmatrix})^{-1} X^T y$$ | \end{bmatrix})^{-1} X^T y$$ | ||
+ | |||
+ | === Gradient descent (Logistic Regression) === | ||
+ | |||
+ | Unterscheide $\theta_0$ und $\theta_j$! | ||
+ | |||
+ | Für $\theta_j$: $\dots + \frac{\lambda}{m} \theta_j$ |