data_mining:neural_network:regularization

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
data_mining:neural_network:regularization [2017/08/19 22:14] – [Inverted dropout] phreazerdata_mining:neural_network:regularization [2017/08/19 22:24] (current) – gelöscht phreazer
Line 1: Line 1:
-====== Regularization ====== 
- 
-Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]]) 
- 
-==== Overfitting (How well does the network generalize?) ==== 
- 
-  * Target values unreliable? 
-  * Sampling errors (accidental regularities of particular training cases) 
- 
-Regularization Methods: 
-  * Weight decay (small weights, simpler model) 
-  * Weight-sharing (same weights) 
-  * Early-stopping (Fake testset, when performance gets worse, stop training) 
-  * Model-Averaging 
-  * Bayes fitting (like model averaging) 
-  * Dropout (Randomly ommit hidden units) 
-  * Generative pre-training) 
- 
-=== $L_1$ and $L_2$ regularization === 
- 
-== Example for logistic regression: == 
- 
-$L_2$ regularization: 
- 
-E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$ 
- 
-$L_1$ regularization: 
- 
-$\frac{\lambda}{2m} ||w||_1$  
- 
-$w$ will be sparse. 
- 
-Use hold-out test set to set hyperparameter. 
- 
-== Neural Network == 
- 
-Cost function  
- 
-$J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_F^2$ 
- 
-Frobenius Norm: $||W^{[l]}||_F^2$ 
- 
-For gradient descent: 
- 
-$dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$ 
- 
-Called **weight decay** (additional multiplication with weights) 
- 
-Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct) 
- 
- 
-===== Dropout ===== 
-Ways to combine output of multiple models: 
-  * MIXTURE: Combine models by averaging their output probabilities. 
-  * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$ 
- 
-NN with one hidden layer. 
-Randomly omit each hidden unit with probability 0.5, for each training sample. 
-Randomly sampling from 2^H architextures. 
- 
-Sampling form 2^H models, and each model only gets one training example (extreme bagging) 
-Sharing of the weights means that every model is very strongly regularized. 
- 
-What to do at test time? 
- 
-Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models. 
- 
-What if we have more hidden Layers? 
- 
-* Use dropout of 0.5 in every layer. 
-* At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation. 
- 
-Dropout prevents overfitting. 
- 
-For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. 
- 
-==== Inverted dropout ==== 
- 
-Layer $l=3$. 
- 
-$keep.prob = 0.8$ 
- 
-$d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$ 
- 
-$a3 = np.multiply(a3,d3)$ 
- 
-$a3 /= keep.prob$ // e.g. 50 units => 10 units shut off 
- 
-$Z = Wa+b$ // reduced by 20% => standardize with 0.8 => expected value stays the same 
- 
-Making predictions at test time: No drop out 
- 
-**Why does it work?** 
- 
-Intuition: Can't rely on any one feature, so spread out of weights => shrink weights. 
- 
-Different keep.probs can be set for different layers (e.g. layers with a lot of parameters). 
- 
-Used in computer vision. 
- 
-Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one). 
- 
-==== More regularization techniques ==== 
- 
-More data: flipping images, transforming or distoring images. 
- 
-Early stopping: Plot train or J and dev set error. Choose those iteration number where dev set error is lowest. 
- 
-Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit). 
  
  • data_mining/neural_network/regularization.1503173687.txt.gz
  • Last modified: 2017/08/19 22:14
  • by phreazer