Table of Contents

Overfitting & Parameter tuning

Problem: Sampling error (selection of training data).

Regularization Methods:

Solves Variance problems (See Error Analysis)

Capacity control

⇒ Typically combination

Meta-parameters

Like number of hidden units, size of the weight penality.

Cross-validation:

N-fold cross-validation (N estimates are not independent).

Early stopping

Init with small weights. Plot train or J and dev set error. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.

Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output.

Downside of ES: Orthogonalization not longer possible (optimize J and try not to overfit).

Limiting size of weights

Idea: Penality to prevent weights getting too large.

Weight penalites

Aka $L_1$ and $L_2$ regularization.

L2 weight penalty (weight decay).

$C = E + \frac{\lambda}{2}\sum_i w_i^2$

Prevents network from using weights that it does not need.

L1 weight penalty

Example for logistic regression:

$L_2$ regularization:

E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$

$L_1$ regularization:

$\frac{\lambda}{2m} ||w||_1$

$w$ will be sparse.

Use hold-out test set to set hyperparameter.

Neural Network

Cost function

$J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||_F^2$

Frobenius Norm: $||W^{[l]}||_F^2$

For gradient descent:

$dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$

Called weight decay (additional multiplication with weights)

For large $\lambda$, $W^{[l]} => 0$

This results in a simpler network / each hidden unit has smaller effect.

When $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.

Weights constraints

Constraint on maximum squared length of the incoming weight vector of each unit (not single weight).

Easier to set sensible value. Prevent hidden units getting stuck near zero Prevent weights from exploding.

When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers).

Dropout

Ways to combine output of multiple models:

NN with one hidden layer. Randomly omit each hidden unit with probability 0.5, for each training sample. Randomly sampling from 2^H architextures.

Sampling form 2^H models, and each model only gets one training example (extreme bagging) Sharing of the weights means that every model is very strongly regularized.

What to do at test time?

Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models.

What if we have more hidden Layers?

* Use dropout of 0.5 in every layer. * At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation.

Dropout prevents overfitting.

For each iteration: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.

Inverted dropout

Layer l=3

keep.prob = 0.8 // probability that unit will be kept

d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob // dropout vector

a3 = np.multiply(a3,d3) // activations in layer 3 a3 *= d3

a3 /= keep.prob // e.g. 50 units => 10 units shut off

Z = Wa+b // reduced by 20% => standardize with 0.8 => expected value stays the same

Making predictions at test time: No drop out

Why does it work?

Intuition: Can't rely on any one feature, so spread out of weights ⇒ shrink weights.

Different keep.probs can be set for different layers (e.g. layers with a lot of parameters).

Used in computer vision.

Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one).

Noise

Use noise in activities as regularizer.

Bayesian Approach

Likelihood term takes into account how probable the observed data is given the parameters of the model.

Favors parameters that make data likely.

Frequentist answer (maximul likelihood): Pick the value of p that makes the obeservation of 54 heads and 47 tails most probable.

This value is p=0.53.

$P(D) = p^{53} (1-p)^{47}$

$\frac{dP(D)}{dp} = 0 for p = 53$

Instead of single answer, take distribution.

Start with prior distribution over p (e.g. uniform distribution). Multiply prior probability of each parameter value by the probability of obeserving a head given that value. Then scalue up all of the probability densities, so that their integral comes to 1. This is the posterior distribution.

Supervised Maximum Likelihood Learning

Find weight vector that minimizes squared residuals ⇔ Find weight vector that maximizes log probability density of correct answer.

Assumption: Answer is generated by adding gaussian noise tot the output of the neural network.

First run neural net on some input, then add gaussian noise on output.

$y_c = f(input_c, W)$

Probability density of the target value given the net's output plus Gaussian noise:

$p(t_c|y_c) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(t_c - y_c)^2}{2\sigma^2}}$

Gaussian distribution centered at the net's output.

$-\log p(t_c|y_c) = k + \frac{(t_c - y_c)^2}{2\sigma^2}$

Maximum a Posteriori

Find full posterior distribution over all possible weight vectors. ⇒ Many combinations.

Instead, try to find most probable weight vector.

$\dots$

MacKay's quick and dirty method of fixing weight costs

Interpret weight penalties, as the ratio of two variances.

Find model that minimizes squared error ⇒ find best value for output noise.

Best value is one that maximizes probability of producting exactly the correct answers after adding Gaussian noise to the output. Variance of residual values.

Empirical Bayes:

QD-method: