data_mining:neural_network:regularization

This is an old revision of the document!


Regularization

Solves Variance problems (See Error Analysis)

  • Target values unreliable?
  • Sampling errors (accidental regularities of particular training cases)

Regularization Methods:

  • Weight decay (small weights, simpler model)
  • Weight-sharing (same weights)
  • Early-stopping (Fake testset, when performance gets worse, stop training)
  • Model-Averaging
  • Bayes fitting (like model averaging)
  • Dropout (Randomly ommit hidden units)
  • Generative pre-training)

$L_1$ and $L_2$ regularization

Example for logistic regression:

$L_2$ regularization:

E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$

$L_1$ regularization:

$\frac{\lambda}{2m} ||w||_1$

$w$ will be sparse.

Use hold-out test set to set hyperparameter.

Neural Network

Cost function

$J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_F^2$

Frobenius Norm: $||W^{[l]}||_F^2$

For gradient descent:

$dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$

Called weight decay (additional multiplication with weights)

Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct)

Ways to combine output of multiple models:

  • MIXTURE: Combine models by averaging their output probabilities.
  • PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$

NN with one hidden layer. Randomly omit each hidden unit with probability 0.5, for each training sample. Randomly sampling from 2^H architextures.

Sampling form 2^H models, and each model only gets one training example (extreme bagging) Sharing of the weights means that every model is very strongly regularized.

What to do at test time?

Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models.

What if we have more hidden Layers?

* Use dropout of 0.5 in every layer. * At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation.

Dropout prevents overfitting.

For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.

Layer $l=3$.

$keep.prob = 0.8$

$d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$

$a3 = np.multiply(a3,d3)$

$a3 /= keep.prob$ e.g. 50 units ⇒ 10 units shut off $Z = Wa+b$ reduced by 20% ⇒ standardize with 0.8 ⇒ expected value stays the same

Making predictions at test time: No drop out

Why does it work?

Intuition: Can't rely on any one feature, so spread out of weights ⇒ shrink weights.

Different keep.probs can be set for different layers (e.g. layers with a lot of parameters).

Used in computer vision.

Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one).

  • data_mining/neural_network/regularization.1503173679.txt.gz
  • Last modified: 2017/08/19 22:14
  • by phreazer