Normalize training sets
Center and standardize
$\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}$
$x = x-\mu$
Standardize:
$x /= \sigma^2$
Use same $\mu$ and $sigma$ for all data partitions (train/test/…).
Why normalizing?
When features are on very different scales ⇒ cost function can be an elongated bowl (small learning rate might be needed).
Exploding / vanishing gradients
When weights are larger or smaller than 1 than they can grow or vanish exponentially in a deep network. Small learning rate must be used ⇒ slow.
Weight init for deep network
$W = np.random.rand(shape) * np.sqrt(1/n^{[l-1]})$
Large n ⇒ smaller $W_i$
For ReLUs.
Other for tanh or Xavier initialization.
Convergence speed (quadratic bowl)
Error surface of linear neuron with squared error ⇒ quadratic bowl. For multi-layer non-linear nets, error surface is more complicated, but locally piece of a quadratic bowl is usually good approximation.
If not a circle, but eclipse:
- Gradient big, but we only want to travel small steps.
- Gradient small to direction in which we want to travel large distance.
Learning rate: If too big, oscillation diverges.
- Want to move quickly in directions with small, but consistent gradients.
- Want to move slowly in directions with big but inconsistent gradients.
Stochastic gradient descent
Dataset is highly redundant: Gradient in first half very similar to gradient in second half.
⇒ Update weights with the gradient in the first half, then get gradient for the new weights on the second half. Extreme version: Update weights after each case (“online”)
Mini-batches usually better than online:
- Less compuation
- Matrix multiplications
Mini-batches need to be balanced for classes.
Basic mini-batch gradient descent algorithm
Guess initial learning rate:
- If error keeps getting worse, or oscillates ⇒ reduce learning rate
- If error is falling consistently, but slowly ⇒ increase learning rate
Towards end of mini-batch learning, turn down learning rate ⇒ removes fluctuations in the final weights
Turn down learning rate, when the error stops decreasing; Use error on a separate validation set.
Weight initialization
- Small random values (not identical).
- If hidden unit has big fan-in, small changes in incoming weights can cause larning to overshoot ⇒ Want smaller incoming weights when fan-in is big. Weight init proportional to $\sqrt{\text{fan-in}}$ (Scaling learning way is possible in same way).
Shifting and scaling inputs
Shift input, so that each component of input vector has zero mean over all training set. Scale input, so that each component of input vector has unit varianze over all training set. Hyperbolic tangent ($2*logistic -1$). produces hidden activation that are roughly zero mean.
⇒ circle error surface.
More thorough: Decorrelate input components
Big win for linear neuron.
With PCA:
- Drop principal components with smalles eigenvalues
- Divide remaining principal components by square root of their eigenvalues (For linear neuron, this converts axis aligned elliptical error surface into circular one).
Common problems in multi-layer netowrks
Starting with big learning rate, leads to big weights of the hidden units. Error derivateives for hidden units will become tiny, error will not decrease. ⇒ Usually ends up in plateau, not local minimum
In nets with squared error or corss-entropy error, the beest strat. is output equal to the proportion of time it should be a 1. ⇒ Also ends up in plateau, not local minimum.
Ways to speed up mini-batch learning
- Use momentum : Use gradient to change velocity of weight particle (not position).
- Use seperate adaptive learning rates for each parameter (using the consistency of the gradient of a parameter).
- rmsprop: Divide learning rate for a weight by a running average of the magnitues of recent gradients for that weight (mini-batch version of using the sign of the gradient).
- Fancy methods from opti literature, that makes use of curvature information
Momentum method
If a “ball” on the error surface has enough velocity, it doesn't follow the steepest descent (instead momentum).
- Damp oscillations, by compining gradients with opposite signs.
- Builds speed in directions with gentle but consistent gradient.
$v(t) = \alpha v(t-1) - epsilon \frac{\partial E}{\partial w} (t)$
Velocity also decasy be $\alpha$ (close to 1)
At beginning of learning, there might be large gradients ⇒ Small momentum (e.g. 0.5) at beginning, then e.g. 0.9 or 0.99.
Allows to use larger learning rates than without momentum.
Nesterov
- First make big jump in direction of previous accumulated gradient.
- Then measure the gradien where you end up an make a correction.
Adaptive Learning rates for each connection
Multi-layer net: appropriate learning rates can vary widely between weights.
- Global learning rate
- Local gain for each weight
Start wit local gain of 1 for every weight. $\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$
- If the gradient for that weight does not change sign
- increase local gain (small additive increase: $+ \delta$)
- Else
- decrease gain (multiplicative decreases: $* 1-\delta$)
Big gains will decay rapidly if oscillations starts. (weight != 1 ⇒ more decrease than increase)
- Limit gains [0.1, 10]
- Use full batch learning or very big mini-batches.
- Adaptive learning rates can be combined with momentum
- Adaptive learning rates only deal with axis-aligned effects.
Rmsprop
rprop
Magnitude of gradients can be very different for different weights and can change during learning
⇒ Hard to choose global learning rate.
Full batch learning: Deal with it by using only sign of the gradient. Weight updates are all of the same magnitude Escapes from platues with tiny gradients quickly.
Rprop: Use sign of gradient, adapt step size for each weight.
- Increase step size multiplicatively (1.2) if signs of two gradients agree.
- Decrease step size multiplicatively (0.5), otherwise
- Limit step size to be less than 50 more than a millonth.
Rprop does not work with (small-medium) mini-batches: When learning rate is small, gradient averages. Rprop would e.g. lead to large wight grows.
rmsprop
Rprop: Equivalent of using gradient, but dividing by gradient size.
Problem: Divide by different number for each mini-batch ⇒ Force division number to be similar for adjacent mini-batches.
rmsprop: Keep moving average of the squared gradient for each weight.
$MeanSquare(w,t) = 0.9 \text{MeanSquare}(w,t-1) + 0.1 (\frac{\partial E}{\partial w}(t))^2$
Divide gradient by $\sqrt{\text{MeanSquare}(w,t)}$ ⇒ improved learning.
Summary
- Small datasets / bigger datasets without redundancy
- Full-batch
- Conugate gradient, LBGGS, adaptive learning rates, rprop, …
- Big, redundant datasets
- Mini-batches
- Gradient descent with momentum
- rmsprop
- Lecun's latest recipe
- No generic recipe:
- Structure: Deep nets (with bottle necks), Recurrent nets, Wide shallow nets
- Tasks: Require accurate weights, have many rare cases (words).
Hyperparameter tuning
Andrew Ng's importance ranking:
1. $\alpha$
2. $\beta ~ 0.9$
2. Number of hidden units
2. Mini-batch size
3. Number of layers
3. Learning rate decay
4. Adam $\beta$ parameters
How to find good values
- Traditional method: Grid of Hyperparameter combinations
- Better: Random values of combinations
- Coarse to fine (params which work good ⇒ focus on smaller square)
Scales
- Sampling uniformly at random : Number of layers, Number of units
- $\alpha$: Sample at log scale between 0.0001 and 0.1 $10^r$
- Exponentially weighted averages: 0.9 and 0.9999 (10 values vs 1000) $1-\beta=1-10^r$
Batch normalization
- Normally applied with mini-batches
Can values be normalized in layers? $Z^{[l]}(i)...$
First mini-batch $X^{\{1\}}$: \[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]
\[X^{\{2\}} \mathrel{\mathop{\rightarrow}} \dots \]
$Z^l = W^l a^{l-1} + b^l$
But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$. $\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$
Algo
- For minibatch t:
- Compute forward prop for $X^{\{t\}}$
- In each hidden layer use BN to replace $Z^l$ with $\tilde{Z}^l$
- Use backprop to compute $dW^{l}, d\beta^{l}, d\gamma^{l}$
- Update parameters …
Why does it work
Covariance shift (shifting input distribution)
- Batch norm reduces amount in which hidden units shifts around, become more stable (input to later layers)
- Slight regularization effect: Adds some noise, because it's normed on the mini batch
Batch norm at test time
Here no mini-batch, but one sample at a time
Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches