data_mining:neural_network:overfitting

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:overfitting [2017/08/19 22:22] – [Noise] phreazerdata_mining:neural_network:overfitting [2018/05/10 18:03] (current) – [Inverted dropout] phreazer
Line 8: Line 8:
  
   * Approach 1: Get more data   * Approach 1: Get more data
 +      * More data: flipping images, transforming or distoring images.
   * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit spurious regularities   * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit spurious regularities
   * Approach 3: Average many different forms. Or train model on different training data (bagging)   * Approach 3: Average many different forms. Or train model on different training data (bagging)
Line 20: Line 21:
   * Dropout (Randomly ommit hidden units)   * Dropout (Randomly ommit hidden units)
   * Generative pre-training)   * Generative pre-training)
 +
 +Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]])
  
 ====== Capacity control ====== ====== Capacity control ======
Line 40: Line 43:
  
 ===== Early stopping ===== ===== Early stopping =====
-Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.+Init with small weights. Plot train or J and dev set error.  Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.
  
 Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output. Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output.
  
 +Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit).
 ===== Limiting size of weights ===== ===== Limiting size of weights =====
  
Line 76: Line 80:
 $w$ will be sparse. $w$ will be sparse.
  
-Use hold-out test set to set hyperparameter.+Use **hold-out** test set to set hyperparameter.
  
 == Neural Network == == Neural Network ==
Line 82: Line 86:
 Cost function  Cost function 
  
-$J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_F^2$+$J(W^{[l]},b^{[l]})= \frac{1}{m\sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$
  
 Frobenius Norm: $||W^{[l]}||_F^2$ Frobenius Norm: $||W^{[l]}||_F^2$
Line 92: Line 96:
 Called **weight decay** (additional multiplication with weights) Called **weight decay** (additional multiplication with weights)
  
-Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct)+For large $\lambda$, $W^{[l]} => 0$
  
 +This results in a **simpler** network / each hidden unit has **smaller effect**.
 +
 +When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
  
 ==== Weights constraints ==== ==== Weights constraints ====
Line 128: Line 135:
 Dropout prevents overfitting. Dropout prevents overfitting.
  
-For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.+For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
  
 ==== Inverted dropout ==== ==== Inverted dropout ====
  
-Layer $l=3$. +<code> 
- +Layer l=3
-$keep.prob = 0.8$+
  
-$d3 np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$+keep.prob = 0.8 // probability that unit will be kept
  
-$a3 = np.multiply(a3,d3)$+d3 = np.random.rand(a3.shape[0]a3.shape[i]< keep.prob // dropout vector
  
-$a3 /keep.prob$ // e.g. 50 units => 10 units shut off+a3 = np.multiply(a3,d3) // activations in layer 3 a3 *d3
  
-$Z Wa+b$ // reduced by 20% => standardize with 0.=> expected value stays the same+a3 /keep.prob // e.g50 units => 10 units shut off
  
 +Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same
 +</code>
 Making predictions at test time: No drop out Making predictions at test time: No drop out
  
  • data_mining/neural_network/overfitting.1503174131.txt.gz
  • Last modified: 2017/08/19 22:22
  • by phreazer