Table of Contents

Normalize training sets

Center and standardize

$\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}$

$x = x-\mu$

Standardize:

$x /= \sigma^2$

Use same $\mu$ and $sigma$ for all data partitions (train/test/…).

Why normalizing?

When features are on very different scales ⇒ cost function can be an elongated bowl (small learning rate might be needed).

Exploding / vanishing gradients

When weights are larger or smaller than 1 than they can grow or vanish exponentially in a deep network. Small learning rate must be used ⇒ slow.

Weight init for deep network

$W = np.random.rand(shape) * np.sqrt(1/n^{[l-1]})$

Large n ⇒ smaller $W_i$

For ReLUs.

Other for tanh or Xavier initialization.

Convergence speed (quadratic bowl)

Error surface of linear neuron with squared error ⇒ quadratic bowl. For multi-layer non-linear nets, error surface is more complicated, but locally piece of a quadratic bowl is usually good approximation.

If not a circle, but eclipse:

Learning rate: If too big, oscillation diverges.

Stochastic gradient descent

Dataset is highly redundant: Gradient in first half very similar to gradient in second half.

⇒ Update weights with the gradient in the first half, then get gradient for the new weights on the second half. Extreme version: Update weights after each case (“online”)

Mini-batches usually better than online:

Mini-batches need to be balanced for classes.

Basic mini-batch gradient descent algorithm

Guess initial learning rate:

Towards end of mini-batch learning, turn down learning rate ⇒ removes fluctuations in the final weights

Turn down learning rate, when the error stops decreasing; Use error on a separate validation set.

Weight initialization

Shifting and scaling inputs

Shift input, so that each component of input vector has zero mean over all training set. Scale input, so that each component of input vector has unit varianze over all training set. Hyperbolic tangent ($2*logistic -1$). produces hidden activation that are roughly zero mean.

⇒ circle error surface.

More thorough: Decorrelate input components

Big win for linear neuron.

With PCA:

Common problems in multi-layer netowrks

Starting with big learning rate, leads to big weights of the hidden units. Error derivateives for hidden units will become tiny, error will not decrease. ⇒ Usually ends up in plateau, not local minimum

In nets with squared error or corss-entropy error, the beest strat. is output equal to the proportion of time it should be a 1. ⇒ Also ends up in plateau, not local minimum.

Ways to speed up mini-batch learning

Momentum method

If a “ball” on the error surface has enough velocity, it doesn't follow the steepest descent (instead momentum).

$v(t) = \alpha v(t-1) - epsilon \frac{\partial E}{\partial w} (t)$

Velocity also decasy be $\alpha$ (close to 1)

At beginning of learning, there might be large gradients ⇒ Small momentum (e.g. 0.5) at beginning, then e.g. 0.9 or 0.99.

Allows to use larger learning rates than without momentum.

Nesterov

Adaptive Learning rates for each connection

Multi-layer net: appropriate learning rates can vary widely between weights.

Start wit local gain of 1 for every weight. $\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$

Big gains will decay rapidly if oscillations starts. (weight != 1 ⇒ more decrease than increase)

Rmsprop

rprop

Magnitude of gradients can be very different for different weights and can change during learning

⇒ Hard to choose global learning rate.

Full batch learning: Deal with it by using only sign of the gradient. Weight updates are all of the same magnitude Escapes from platues with tiny gradients quickly.

Rprop: Use sign of gradient, adapt step size for each weight.

Rprop does not work with (small-medium) mini-batches: When learning rate is small, gradient averages. Rprop would e.g. lead to large wight grows.

rmsprop

Rprop: Equivalent of using gradient, but dividing by gradient size.

Problem: Divide by different number for each mini-batch ⇒ Force division number to be similar for adjacent mini-batches.

rmsprop: Keep moving average of the squared gradient for each weight.

$MeanSquare(w,t) = 0.9 \text{MeanSquare}(w,t-1) + 0.1 (\frac{\partial E}{\partial w}(t))^2$

Divide gradient by $\sqrt{\text{MeanSquare}(w,t)}$ ⇒ improved learning.

Summary

Hyperparameter tuning

Andrew Ng's importance ranking:

1. $\alpha$

2. $\beta ~ 0.9$

2. Number of hidden units

2. Mini-batch size

3. Number of layers

3. Learning rate decay

4. Adam $\beta$ parameters

How to find good values

Scales

Batch normalization

Can values be normalized in layers? $Z^{[l]}(i)...$

First mini-batch $X^{\{1\}}$: \[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]

\[X^{\{2\}} \mathrel{\mathop{\rightarrow}} \dots \]

$Z^l = W^l a^{l-1} + b^l$

But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$. $\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$

Algo

Why does it work

Covariance shift (shifting input distribution)

Batch norm at test time

Here no mini-batch, but one sample at a time

Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches