data_mining:neural_network:overfitting

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:overfitting [2017/08/19 18:37] phreazerdata_mining:neural_network:overfitting [2018/05/10 18:03] (current) – [Inverted dropout] phreazer
Line 1: Line 1:
-====== Data set paratitioning ====== 
- 
-Train / Dev / Test set 
- 
-Traditionally: 60 / 20 / 20 
-Big Data: 95.5 / 0.4 / 0.1 
  
 ====== Overfitting & Parameter tuning ====== ====== Overfitting & Parameter tuning ======
  
 Problem: Sampling error (selection of training data). Problem: Sampling error (selection of training data).
 +
 +  * Target values unreliable?
 +  * Sampling errors (accidental regularities of particular training cases)
  
   * Approach 1: Get more data   * Approach 1: Get more data
-  * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit sppurious regularities+      * More data: flipping images, transforming or distoring images. 
 +  * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit spurious regularities
   * Approach 3: Average many different forms. Or train model on different training data (bagging)   * Approach 3: Average many different forms. Or train model on different training data (bagging)
-  * Approach 4: Bayesian: Single NN but average prdictions made by many different weight vectors.+  * Approach 4: Bayesian: Single NN but average predictions made by many different weight vectors. 
 + 
 +Regularization Methods: 
 +  * Weight decay (small weights, simpler model) 
 +  * Weight-sharing (same weights) 
 +  * Early-stopping (Fake testset, when performance gets worse, stop training) 
 +  * Model-Averaging 
 +  * Bayes fitting (like model averaging) 
 +  * Dropout (Randomly ommit hidden units) 
 +  * Generative pre-training) 
 + 
 +Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]])
  
 ====== Capacity control ====== ====== Capacity control ======
Line 34: Line 43:
  
 ===== Early stopping ===== ===== Early stopping =====
-Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.+Init with small weights. Plot train or J and dev set error.  Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.
  
 Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output. Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output.
  
 +Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit).
 ===== Limiting size of weights ===== ===== Limiting size of weights =====
  
Line 43: Line 53:
  
 ==== Weight penalites ==== ==== Weight penalites ====
 +
 +Aka $L_1$ and $L_2$ regularization.
  
 L2 weight penalty (weight decay). L2 weight penalty (weight decay).
Line 55: Line 67:
   * Many weights will be exactly zero   * Many weights will be exactly zero
   * Sometimes better to use weight penalty that has neglible effect on large weights.   * Sometimes better to use weight penalty that has neglible effect on large weights.
 +
 +== Example for logistic regression: ==
 +
 +$L_2$ regularization:
 +
 +E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$
 +
 +$L_1$ regularization:
 +
 +$\frac{\lambda}{2m} ||w||_1$ 
 +
 +$w$ will be sparse.
 +
 +Use **hold-out** test set to set hyperparameter.
 +
 +== Neural Network ==
 +
 +Cost function 
 +
 +$J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$
 +
 +Frobenius Norm: $||W^{[l]}||_F^2$
 +
 +For gradient descent:
 +
 +$dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$
 +
 +Called **weight decay** (additional multiplication with weights)
 +
 +For large $\lambda$, $W^{[l]} => 0$
 +
 +This results in a **simpler** network / each hidden unit has **smaller effect**.
 +
 +When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
  
 ==== Weights constraints ==== ==== Weights constraints ====
Line 65: Line 111:
  
 When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers). When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers).
 +
 +===== Dropout =====
 +Ways to combine output of multiple models:
 +  * MIXTURE: Combine models by averaging their output probabilities.
 +  * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$
 +
 +NN with one hidden layer.
 +Randomly omit each hidden unit with probability 0.5, for each training sample.
 +Randomly sampling from 2^H architextures.
 +
 +Sampling form 2^H models, and each model only gets one training example (extreme bagging)
 +Sharing of the weights means that every model is very strongly regularized.
 +
 +What to do at test time?
 +
 +Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models.
 +
 +What if we have more hidden Layers?
 +
 +* Use dropout of 0.5 in every layer.
 +* At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation.
 +
 +Dropout prevents overfitting.
 +
 +For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
 +
 +==== Inverted dropout ====
 +
 +<code>
 +Layer l=3
 +
 +keep.prob = 0.8 // probability that unit will be kept
 +
 +d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob // dropout vector
 +
 +a3 = np.multiply(a3,d3) // activations in layer 3 a3 *= d3
 +
 +a3 /= keep.prob // e.g. 50 units => 10 units shut off
 +
 +Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same
 +</code>
 +Making predictions at test time: No drop out
 +
 +**Why does it work?**
 +
 +Intuition: Can't rely on any one feature, so spread out of weights => shrink weights.
 +
 +Different keep.probs can be set for different layers (e.g. layers with a lot of parameters).
 +
 +Used in computer vision.
 +
 +Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one).
 +
  
 ===== Noise ===== ===== Noise =====
  • data_mining/neural_network/overfitting.1503160668.txt.gz
  • Last modified: 2017/08/19 18:37
  • by phreazer