Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:overfitting [2017/08/19 22:23] – [Early stopping] phreazer
+++ data_mining:neural_network:overfitting [2018/05/10 17:55] – [Inverted dropout] phreazer
@@ Line 21: / Line 21: @@
   * Dropout (Randomly ommit hidden units)
   * Generative pre-training)
+Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]])
 ====== Capacity control ======
@@ Line 78: / Line 80: @@
 $w$ will be sparse.
-Use hold-out test set to set hyperparameter.
+Use **hold-out** test set to set hyperparameter.
 == Neural Network ==
@@ Line 84: / Line 86: @@
 Cost function
-$J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_F^2$
+$J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$
 Frobenius Norm: $||W^{[l]}||_F^2$
@@ Line 94: / Line 96: @@
 Called **weight decay** (additional multiplication with weights)
-Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct)
+For large $\lambda$, $W^{[l]} => 0$
+This results in a **simpler** network / each hidden unit has **smaller effect**.
+When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
 ==== Weights constraints ====
@@ Line 130: / Line 135: @@
 Dropout prevents overfitting.
-For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
+For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
 ==== Inverted dropout ====
@@ Line 136: / Line 141: @@
 Layer $l=3$.
-$keep.prob = 0.8$
+$keep.prob = 0.8$ // probability that unit will be kept
 $d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$
-$a3 = np.multiply(a3,d3)$
+$a3 = np.multiply(a3,d3)$ // activations in layer 3 $a3 *= d3$
 $a3 /= keep.prob$ // e.g. 50 units => 10 units shut off