Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:overfitting [2017/03/26 11:27] – [MacKay's quick and dirty method of fixing weight costs] phreazer | data_mining:neural_network:overfitting [2018/05/10 16:03] (current) – [Inverted dropout] phreazer | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
====== Overfitting & Parameter tuning ====== | ====== Overfitting & Parameter tuning ====== | ||
Problem: Sampling error (selection of training data). | Problem: Sampling error (selection of training data). | ||
+ | |||
+ | * Target values unreliable? | ||
+ | * Sampling errors (accidental regularities of particular training cases) | ||
* Approach 1: Get more data | * Approach 1: Get more data | ||
- | | + | * More data: flipping images, transforming or distoring images. |
+ | | ||
* Approach 3: Average many different forms. Or train model on different training data (bagging) | * Approach 3: Average many different forms. Or train model on different training data (bagging) | ||
- | * Approach 4: Bayesian: Single NN but average | + | * Approach 4: Bayesian: Single NN but average |
+ | |||
+ | Regularization Methods: | ||
+ | * Weight decay (small weights, simpler model) | ||
+ | * Weight-sharing (same weights) | ||
+ | * Early-stopping (Fake testset, when performance gets worse, stop training) | ||
+ | * Model-Averaging | ||
+ | * Bayes fitting (like model averaging) | ||
+ | * Dropout (Randomly ommit hidden units) | ||
+ | * Generative pre-training) | ||
+ | |||
+ | Solves Variance problems (See [[data_mining: | ||
====== Capacity control ====== | ====== Capacity control ====== | ||
Line 27: | Line 43: | ||
===== Early stopping ===== | ===== Early stopping ===== | ||
- | Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse. | + | Init with small weights. |
Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output. | Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output. | ||
+ | Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit). | ||
===== Limiting size of weights ===== | ===== Limiting size of weights ===== | ||
Line 36: | Line 53: | ||
==== Weight penalites ==== | ==== Weight penalites ==== | ||
+ | |||
+ | Aka $L_1$ and $L_2$ regularization. | ||
L2 weight penalty (weight decay). | L2 weight penalty (weight decay). | ||
Line 48: | Line 67: | ||
* Many weights will be exactly zero | * Many weights will be exactly zero | ||
* Sometimes better to use weight penalty that has neglible effect on large weights. | * Sometimes better to use weight penalty that has neglible effect on large weights. | ||
+ | |||
+ | == Example for logistic regression: == | ||
+ | |||
+ | $L_2$ regularization: | ||
+ | |||
+ | E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$ | ||
+ | |||
+ | $L_1$ regularization: | ||
+ | |||
+ | $\frac{\lambda}{2m} ||w||_1$ | ||
+ | |||
+ | $w$ will be sparse. | ||
+ | |||
+ | Use **hold-out** test set to set hyperparameter. | ||
+ | |||
+ | == Neural Network == | ||
+ | |||
+ | Cost function | ||
+ | |||
+ | $J(W^{[l]}, | ||
+ | |||
+ | Frobenius Norm: $||W^{[l]}||_F^2$ | ||
+ | |||
+ | For gradient descent: | ||
+ | |||
+ | $dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$ | ||
+ | |||
+ | Called **weight decay** (additional multiplication with weights) | ||
+ | |||
+ | For large $\lambda$, $W^{[l]} => 0$ | ||
+ | |||
+ | This results in a **simpler** network / each hidden unit has **smaller effect**. | ||
+ | |||
+ | When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear. | ||
==== Weights constraints ==== | ==== Weights constraints ==== | ||
Line 58: | Line 111: | ||
When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers). | When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers). | ||
+ | |||
+ | ===== Dropout ===== | ||
+ | Ways to combine output of multiple models: | ||
+ | * MIXTURE: Combine models by averaging their output probabilities. | ||
+ | * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ | ||
+ | |||
+ | NN with one hidden layer. | ||
+ | Randomly omit each hidden unit with probability 0.5, for each training sample. | ||
+ | Randomly sampling from 2^H architextures. | ||
+ | |||
+ | Sampling form 2^H models, and each model only gets one training example (extreme bagging) | ||
+ | Sharing of the weights means that every model is very strongly regularized. | ||
+ | |||
+ | What to do at test time? | ||
+ | |||
+ | Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models. | ||
+ | |||
+ | What if we have more hidden Layers? | ||
+ | |||
+ | * Use dropout of 0.5 in every layer. | ||
+ | * At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation. | ||
+ | |||
+ | Dropout prevents overfitting. | ||
+ | |||
+ | For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. | ||
+ | |||
+ | ==== Inverted dropout ==== | ||
+ | |||
+ | < | ||
+ | Layer l=3 | ||
+ | |||
+ | keep.prob = 0.8 // probability that unit will be kept | ||
+ | |||
+ | d3 = np.random.rand(a3.shape[0], | ||
+ | |||
+ | a3 = np.multiply(a3, | ||
+ | |||
+ | a3 /= keep.prob // e.g. 50 units => 10 units shut off | ||
+ | |||
+ | Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same | ||
+ | </ | ||
+ | Making predictions at test time: No drop out | ||
+ | |||
+ | **Why does it work?** | ||
+ | |||
+ | Intuition: Can't rely on any one feature, so spread out of weights => shrink weights. | ||
+ | |||
+ | Different keep.probs can be set for different layers (e.g. layers with a lot of parameters). | ||
+ | |||
+ | Used in computer vision. | ||
+ | |||
+ | Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one). | ||
+ | |||
===== Noise ===== | ===== Noise ===== |