Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:regularization [2017/08/19 20:14] – [More regularization techniques] phreazer | data_mining:neural_network:regularization [2017/08/19 20:24] (current) – gelöscht phreazer | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Regularization ====== | ||
- | Solves Variance problems (See [[data_mining: | ||
- | |||
- | ==== Overfitting (How well does the network generalize? | ||
- | |||
- | * Target values unreliable? | ||
- | * Sampling errors (accidental regularities of particular training cases) | ||
- | |||
- | Regularization Methods: | ||
- | * Weight decay (small weights, simpler model) | ||
- | * Weight-sharing (same weights) | ||
- | * Early-stopping (Fake testset, when performance gets worse, stop training) | ||
- | * Model-Averaging | ||
- | * Bayes fitting (like model averaging) | ||
- | * Dropout (Randomly ommit hidden units) | ||
- | * Generative pre-training) | ||
- | |||
- | === $L_1$ and $L_2$ regularization === | ||
- | |||
- | == Example for logistic regression: == | ||
- | |||
- | $L_2$ regularization: | ||
- | |||
- | E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$ | ||
- | |||
- | $L_1$ regularization: | ||
- | |||
- | $\frac{\lambda}{2m} ||w||_1$ | ||
- | |||
- | $w$ will be sparse. | ||
- | |||
- | Use hold-out test set to set hyperparameter. | ||
- | |||
- | == Neural Network == | ||
- | |||
- | Cost function | ||
- | |||
- | $J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, | ||
- | |||
- | Frobenius Norm: $||W^{[l]}||_F^2$ | ||
- | |||
- | For gradient descent: | ||
- | |||
- | $dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$ | ||
- | |||
- | Called **weight decay** (additional multiplication with weights) | ||
- | |||
- | Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct) | ||
- | |||
- | |||
- | ===== Dropout ===== | ||
- | Ways to combine output of multiple models: | ||
- | * MIXTURE: Combine models by averaging their output probabilities. | ||
- | * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ | ||
- | |||
- | NN with one hidden layer. | ||
- | Randomly omit each hidden unit with probability 0.5, for each training sample. | ||
- | Randomly sampling from 2^H architextures. | ||
- | |||
- | Sampling form 2^H models, and each model only gets one training example (extreme bagging) | ||
- | Sharing of the weights means that every model is very strongly regularized. | ||
- | |||
- | What to do at test time? | ||
- | |||
- | Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models. | ||
- | |||
- | What if we have more hidden Layers? | ||
- | |||
- | * Use dropout of 0.5 in every layer. | ||
- | * At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation. | ||
- | |||
- | Dropout prevents overfitting. | ||
- | |||
- | For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. | ||
- | |||
- | ==== Inverted dropout ==== | ||
- | |||
- | Layer $l=3$. | ||
- | |||
- | $keep.prob = 0.8$ | ||
- | |||
- | $d3 = np.random.rand(a3.shape[0], | ||
- | |||
- | $a3 = np.multiply(a3, | ||
- | |||
- | $a3 /= keep.prob$ // e.g. 50 units => 10 units shut off | ||
- | |||
- | $Z = Wa+b$ // reduced by 20% => standardize with 0.8 => expected value stays the same | ||
- | |||
- | Making predictions at test time: No drop out | ||
- | |||
- | **Why does it work?** | ||
- | |||
- | Intuition: Can't rely on any one feature, so spread out of weights => shrink weights. | ||
- | |||
- | Different keep.probs can be set for different layers (e.g. layers with a lot of parameters). | ||
- | |||
- | Used in computer vision. | ||
- | |||
- | Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one). |