Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
data_mining:neural_network:overfitting [2017/08/19 23:15] – [Dropout] phreazer | data_mining:neural_network:overfitting [2018/05/10 17:55] – [Inverted dropout] phreazer | ||
---|---|---|---|
Line 80: | Line 80: | ||
$w$ will be sparse. | $w$ will be sparse. | ||
- | Use hold-out test set to set hyperparameter. | + | Use **hold-out** test set to set hyperparameter. |
== Neural Network == | == Neural Network == | ||
Line 86: | Line 86: | ||
Cost function | Cost function | ||
- | $J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, | + | $J(W^{[l]}, |
Frobenius Norm: $||W^{[l]}||_F^2$ | Frobenius Norm: $||W^{[l]}||_F^2$ | ||
Line 96: | Line 96: | ||
Called **weight decay** (additional multiplication with weights) | Called **weight decay** (additional multiplication with weights) | ||
- | Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct) | + | For large $\lambda$, $W^{[l]} => 0$ |
+ | This results in a **simpler** network / each hidden unit has **smaller effect**. | ||
+ | |||
+ | When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear. | ||
==== Weights constraints ==== | ==== Weights constraints ==== | ||
Line 138: | Line 141: | ||
Layer $l=3$. | Layer $l=3$. | ||
- | $keep.prob = 0.8$ | + | $keep.prob = 0.8$ // probability that unit will be kept |
$d3 = np.random.rand(a3.shape[0], | $d3 = np.random.rand(a3.shape[0], | ||
- | $a3 = np.multiply(a3, | + | $a3 = np.multiply(a3, |
$a3 /= keep.prob$ // e.g. 50 units => 10 units shut off | $a3 /= keep.prob$ // e.g. 50 units => 10 units shut off |