Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
data_mining:logistic_regression [2018/05/10 17:39] – [Regularization] phreazer | data_mining:logistic_regression [2018/05/10 17:46] – [Regularization] phreazer | ||
---|---|---|---|
Line 74: | Line 74: | ||
$min \dots + \lambda \sum_{i=1}^n \theta_j^2$ | $min \dots + \lambda \sum_{i=1}^n \theta_j^2$ | ||
- | $J(W^l, | + | === L2 Regularization === |
+ | |||
+ | For large $\lambda$, $W^{[l]} => 0$ | ||
+ | |||
+ | $J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||^2$ | ||
+ | |||
+ | This results in a **simpler** network / each hidden unit has **smaller effect**. | ||
+ | |||
+ | Another effect, wehn $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear. | ||
+ | |||
+ | === Dropout === | ||
- | For large $lambda$, $W^l => 0$ | ||
=== Gradient descent (Linear Regression) === | === Gradient descent (Linear Regression) === |