Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:overfitting [2017/08/19 18:37] – phreazer
+++ data_mining:neural_network:overfitting [2018/05/10 18:03] (current) – [Inverted dropout] phreazer
@@ Line 1: / Line 1: @@
-====== Data set paratitioning ======
-Train / Dev / Test set
-Traditionally: 60 / 20 / 20
-Big Data: 95.5 / 0.4 / 0.1
 ====== Overfitting & Parameter tuning ======
 Problem: Sampling error (selection of training data).
+  * Target values unreliable?
+  * Sampling errors (accidental regularities of particular training cases)
   * Approach 1: Get more data
-  * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit sppurious regularities
+      * More data: flipping images, transforming or distoring images.
+  * Approach 2: Right capacity: Enough to fit true regularities, not enough to fit spurious regularities
   * Approach 3: Average many different forms. Or train model on different training data (bagging)
-  * Approach 4: Bayesian: Single NN but average prdictions made by many different weight vectors.
+  * Approach 4: Bayesian: Single NN but average predictions made by many different weight vectors.
+Regularization Methods:
+  * Weight decay (small weights, simpler model)
+  * Weight-sharing (same weights)
+  * Early-stopping (Fake testset, when performance gets worse, stop training)
+  * Model-Averaging
+  * Bayes fitting (like model averaging)
+  * Dropout (Randomly ommit hidden units)
+  * Generative pre-training)
+Solves Variance problems (See [[data_mining:error_analysis|Error Analysis]])
 ====== Capacity control ======
@@ Line 34: / Line 43: @@
 ===== Early stopping =====
-Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.
+Init with small weights. Plot train or J and dev set error.  Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.
 Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output.
+Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit).
 ===== Limiting size of weights =====
@@ Line 43: / Line 53: @@
 ==== Weight penalites ====
+Aka $L_1$ and $L_2$ regularization.
 L2 weight penalty (weight decay).
@@ Line 55: / Line 67: @@
   * Many weights will be exactly zero
   * Sometimes better to use weight penalty that has neglible effect on large weights.
+== Example for logistic regression: ==
+$L_2$ regularization:
+E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$
+$L_1$ regularization:
+$\frac{\lambda}{2m} ||w||_1$
+$w$ will be sparse.
+Use **hold-out** test set to set hyperparameter.
+== Neural Network ==
+Cost function
+$J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$
+Frobenius Norm: $||W^{[l]}||_F^2$
+For gradient descent:
+$dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$
+Called **weight decay** (additional multiplication with weights)
+For large $\lambda$, $W^{[l]} => 0$
+This results in a **simpler** network / each hidden unit has **smaller effect**.
+When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
 ==== Weights constraints ====
@@ Line 65: / Line 111: @@
 When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers).
+===== Dropout =====
+Ways to combine output of multiple models:
+  * MIXTURE: Combine models by averaging their output probabilities.
+  * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$
+NN with one hidden layer.
+Randomly omit each hidden unit with probability 0.5, for each training sample.
+Randomly sampling from 2^H architextures.
+Sampling form 2^H models, and each model only gets one training example (extreme bagging)
+Sharing of the weights means that every model is very strongly regularized.
+What to do at test time?
+Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models.
+What if we have more hidden Layers?
+* Use dropout of 0.5 in every layer.
+* At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation.
+Dropout prevents overfitting.
+For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
+==== Inverted dropout ====
+<code>
+Layer l=3
+keep.prob = 0.8 // probability that unit will be kept
+d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob // dropout vector
+a3 = np.multiply(a3,d3) // activations in layer 3 a3 *= d3
+a3 /= keep.prob // e.g. 50 units => 10 units shut off
+Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same
+</code>
+Making predictions at test time: No drop out
+**Why does it work?**
+Intuition: Can't rely on any one feature, so spread out of weights => shrink weights.
+Different keep.probs can be set for different layers (e.g. layers with a lot of parameters).
+Used in computer vision.
+Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one).
 ===== Noise =====