Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:overfitting [2017/08/19 22:21] – [Weight penalites] phreazer
+++ data_mining:neural_network:overfitting [2018/05/10 18:03] (current) – [Inverted dropout] phreazer
@@ Line 8: / Line 8: @@
   * Approach 1: Get more data
+      * More data: flipping images, transforming or distoring images.
   * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit spurious regularities
   * Approach 3: Average many different forms. Or train model on different training data (bagging)
@@ Line 20: / Line 21: @@
   * Dropout (Randomly ommit hidden units)
   * Generative pre-training)
+Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]])
 ====== Capacity control ======
@@ Line 40: / Line 43: @@
 ===== Early stopping =====
-Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.
+Init with small weights. Plot train or J and dev set error.  Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.
 Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output.
+Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit).
 ===== Limiting size of weights =====
@@ Line 76: / Line 80: @@
 $w$ will be sparse.
-Use hold-out test set to set hyperparameter.
+Use **hold-out** test set to set hyperparameter.
 == Neural Network ==
@@ Line 82: / Line 86: @@
 Cost function
-$J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_F^2$
+$J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$
 Frobenius Norm: $||W^{[l]}||_F^2$
@@ Line 92: / Line 96: @@
 Called **weight decay** (additional multiplication with weights)
-Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct)
+For large $\lambda$, $W^{[l]} => 0$
+This results in a **simpler** network / each hidden unit has **smaller effect**.
+When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
 ==== Weights constraints ====
@@ Line 104: / Line 111: @@
 When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers).
+===== Dropout =====
+Ways to combine output of multiple models:
+  * MIXTURE: Combine models by averaging their output probabilities.
+  * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$
+NN with one hidden layer.
+Randomly omit each hidden unit with probability 0.5, for each training sample.
+Randomly sampling from 2^H architextures.
+Sampling form 2^H models, and each model only gets one training example (extreme bagging)
+Sharing of the weights means that every model is very strongly regularized.
+What to do at test time?
+Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models.
+What if we have more hidden Layers?
+* Use dropout of 0.5 in every layer.
+* At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation.
+Dropout prevents overfitting.
+For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
+==== Inverted dropout ====
+<code>
+Layer l=3
+keep.prob = 0.8 // probability that unit will be kept
+d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob // dropout vector
+a3 = np.multiply(a3,d3) // activations in layer 3 a3 *= d3
+a3 /= keep.prob // e.g. 50 units => 10 units shut off
+Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same
+</code>
+Making predictions at test time: No drop out
+**Why does it work?**
+Intuition: Can't rely on any one feature, so spread out of weights => shrink weights.
+Different keep.probs can be set for different layers (e.g. layers with a lot of parameters).
+Used in computer vision.
+Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one).
 ===== Noise =====