Overfitting & Parameter tuning

This is an old revision of the document!

Problem: Sampling error (selection of training data).

Target values unreliable?
Sampling errors (accidental regularities of particular training cases)

Approach 1: Get more data
Approach 2: Right capacity: Enough to fit true regularities, not enough to fit spurious regularities
Approach 3: Average many different forms. Or train model on different training data (bagging)
Approach 4: Bayesian: Single NN but average predictions made by many different weight vectors.

Regularization Methods:

Weight decay (small weights, simpler model)
Weight-sharing (same weights)
Early-stopping (Fake testset, when performance gets worse, stop training)
Model-Averaging
Bayes fitting (like model averaging)
Dropout (Randomly ommit hidden units)
Generative pre-training)

Architecture: Number of hidden layers, number of units per layer
Early stopping: Start with small weights and stop learning before it overfits.
Weight-decay: Penalize large weights using penalties/constraints on their squares/absolute values
Noise: Add noise to weights or activities

⇒ Typically combination

Like number of hidden units, size of the weight penality.

Cross-validation:

Training data: Used for learning parameters of the model
Validation data: Not used for learning, but used for deciding what settings of the meta parameters work best.
Test data: Final unbiased esitmate how networks works

N-fold cross-validation (N estimates are not independent).

Init with small weights. Watch performance on validation set. If performance gets worse, stop it. Get back to the point before things got worse.

Small weights; Logistic units near zero, behave like linear units. Network is similar to linear network. Has no more compacity than linear net in which inputs are directly connected to the output.

Idea: Penality to prevent weights getting too large.

L2 weight penalty (weight decay).

Keeps weigths small unless they have big error derivatives.

$C = E + \frac{\lambda}{2}\sum_i w_i^2$

Prevents network from using weights that it does not need.

L1 weight penalty

Many weights will be exactly zero
Sometimes better to use weight penalty that has neglible effect on large weights.

Constraint on maximum squared length of the incoming weight vector of each unit (not single weight).

Scale down vector of incoming weights to allowed length (if length was exceeded).

Easier to set sensible value. Prevent hidden units getting stuck near zero Prevent weights from exploding.

When unit hits limit, effective weight penality on all of it's weights is determined by the big gradiens: Much more effective (lagrange multipliers).

Variance of noise amplified by squared weight before going into next layer.
Linear output: Amplified noise gets added to output. (Additive contribution to squared error)
Adding Gaussian noise not equivalent to using an L2 weight penality for non-linear multilayer nets
- But may work better, esp. in recurrent networks.

Use noise in activities as regularizer.

Prior prob for everything
See some data, combine prior distribution with a likelihood term
posterior distributin.

Likelihood term takes into account how probable the observed data is given the parameters of the model.

Favors parameters that make data likely.

Frequentist answer (maximul likelihood): Pick the value of p that makes the obeservation of 54 heads and 47 tails most probable.

This value is p=0.53.

$P(D) = p^{53} (1-p)^{47}$

$\frac{dP(D)}{dp} = 0 for p = 53$

Instead of single answer, take distribution.

Start with prior distribution over p (e.g. uniform distribution). Multiply prior probability of each parameter value by the probability of obeserving a head given that value. Then scalue up all of the probability densities, so that their integral comes to 1. This is the posterior distribution.

Find weight vector that minimizes squared residuals ⇔ Find weight vector that maximizes log probability density of correct answer.

Assumption: Answer is generated by adding gaussian noise tot the output of the neural network.

First run neural net on some input, then add gaussian noise on output.

$y_c = f(input_c, W)$

Probability density of the target value given the net's output plus Gaussian noise:

$p(t_c|y_c) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(t_c - y_c)^2}{2\sigma^2}}$

Gaussian distribution centered at the net's output.

$-\log p(t_c|y_c) = k + \frac{(t_c - y_c)^2}{2\sigma^2}$

Maximum a Posteriori

Find full posterior distribution over all possible weight vectors. ⇒ Many combinations.

Instead, try to find most probable weight vector.

Find optimum by starting with a random weight vector, then adjusting it in the direction that improves $p(W|D)$.
- Only local optimum

$\dots$

Interpret weight penalties, as the ratio of two variances.

Find model that minimizes squared error ⇒ find best value for output noise.

Best value is one that maximizes probability of producting exactly the correct answers after adding Gaussian noise to the output. Variance of residual values.

Empirical Bayes:

Set variance of the gaussian prior, so that it makes weights that the model learned, most likely.
Done by fitting a zero-mean gaussian to the one-dimensional distribution of the learned wight values.

QD-method:

Guess noise variance and weight prior variance
Repeat:
- Lean, using ratio of variances as wight penalty coefficient.
- Reset noise variance to be var of the residual errors.
- Reset weight prior variance to be var of the distribution of the actual learned weights.