Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:overfitting [2017/08/19 20:22] – [Overfitting & Parameter tuning] phreazer | data_mining:neural_network:overfitting [2018/05/10 16:03] (current) – [Inverted dropout] phreazer | ||
---|---|---|---|
Line 21: | Line 21: | ||
* Dropout (Randomly ommit hidden units) | * Dropout (Randomly ommit hidden units) | ||
* Generative pre-training) | * Generative pre-training) | ||
+ | |||
+ | Solves Variance problems (See [[data_mining: | ||
====== Capacity control ====== | ====== Capacity control ====== | ||
Line 41: | Line 43: | ||
===== Early stopping ===== | ===== Early stopping ===== | ||
- | Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse. | + | Init with small weights. |
Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output. | Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output. | ||
+ | Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit). | ||
===== Limiting size of weights ===== | ===== Limiting size of weights ===== | ||
Line 77: | Line 80: | ||
$w$ will be sparse. | $w$ will be sparse. | ||
- | Use hold-out test set to set hyperparameter. | + | Use **hold-out** test set to set hyperparameter. |
== Neural Network == | == Neural Network == | ||
Line 83: | Line 86: | ||
Cost function | Cost function | ||
- | $J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, | + | $J(W^{[l]}, |
Frobenius Norm: $||W^{[l]}||_F^2$ | Frobenius Norm: $||W^{[l]}||_F^2$ | ||
Line 93: | Line 96: | ||
Called **weight decay** (additional multiplication with weights) | Called **weight decay** (additional multiplication with weights) | ||
- | Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct) | + | For large $\lambda$, $W^{[l]} => 0$ |
+ | This results in a **simpler** network / each hidden unit has **smaller effect**. | ||
+ | |||
+ | When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear. | ||
==== Weights constraints ==== | ==== Weights constraints ==== | ||
Line 129: | Line 135: | ||
Dropout prevents overfitting. | Dropout prevents overfitting. | ||
- | For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. | + | For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. |
==== Inverted dropout ==== | ==== Inverted dropout ==== | ||
- | Layer $l=3$. | + | < |
- | + | Layer l=3 | |
- | $keep.prob = 0.8$ | + | |
- | $d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$ | + | keep.prob |
- | $a3 = np.multiply(a3,d3)$ | + | d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob // dropout vector |
- | $a3 /= keep.prob$ // e.g. 50 units => 10 units shut off | + | a3 = np.multiply(a3, |
- | $Z = Wa+b$ // reduced by 20% => standardize with 0.8 => expected value stays the same | + | a3 /= keep.prob |
+ | Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same | ||
+ | </ | ||
Making predictions at test time: No drop out | Making predictions at test time: No drop out | ||