data_mining:neural_network:overfitting

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:overfitting [2017/08/19 22:23] – [Early stopping] phreazerdata_mining:neural_network:overfitting [2018/05/10 18:03] (current) – [Inverted dropout] phreazer
Line 21: Line 21:
   * Dropout (Randomly ommit hidden units)   * Dropout (Randomly ommit hidden units)
   * Generative pre-training)   * Generative pre-training)
 +
 +Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]])
  
 ====== Capacity control ====== ====== Capacity control ======
Line 78: Line 80:
 $w$ will be sparse. $w$ will be sparse.
  
-Use hold-out test set to set hyperparameter.+Use **hold-out** test set to set hyperparameter.
  
 == Neural Network == == Neural Network ==
Line 84: Line 86:
 Cost function  Cost function 
  
-$J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_F^2$+$J(W^{[l]},b^{[l]})= \frac{1}{m\sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$
  
 Frobenius Norm: $||W^{[l]}||_F^2$ Frobenius Norm: $||W^{[l]}||_F^2$
Line 94: Line 96:
 Called **weight decay** (additional multiplication with weights) Called **weight decay** (additional multiplication with weights)
  
-Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct)+For large $\lambda$, $W^{[l]} => 0$
  
 +This results in a **simpler** network / each hidden unit has **smaller effect**.
 +
 +When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
  
 ==== Weights constraints ==== ==== Weights constraints ====
Line 130: Line 135:
 Dropout prevents overfitting. Dropout prevents overfitting.
  
-For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.+For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
  
 ==== Inverted dropout ==== ==== Inverted dropout ====
  
-Layer $l=3$. +<code> 
- +Layer l=3
-$keep.prob = 0.8$+
  
-$d3 np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$+keep.prob = 0.8 // probability that unit will be kept
  
-$a3 = np.multiply(a3,d3)$+d3 = np.random.rand(a3.shape[0]a3.shape[i]< keep.prob // dropout vector
  
-$a3 /keep.prob$ // e.g. 50 units => 10 units shut off+a3 = np.multiply(a3,d3) // activations in layer 3 a3 *d3
  
-$Z Wa+b$ // reduced by 20% => standardize with 0.=> expected value stays the same+a3 /keep.prob // e.g50 units => 10 units shut off
  
 +Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same
 +</code>
 Making predictions at test time: No drop out Making predictions at test time: No drop out
  
  • data_mining/neural_network/overfitting.1503174220.txt.gz
  • Last modified: 2017/08/19 22:23
  • by phreazer