Center and standardize
$\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}$
$x = x-\mu$
Standardize:
$x /= \sigma^2$
Use same $\mu$ and $sigma$ for all data partitions (train/test/…).
When features are on very different scales ⇒ cost function can be an elongated bowl (small learning rate might be needed).
When weights are larger or smaller than 1 than they can grow or vanish exponentially in a deep network. Small learning rate must be used ⇒ slow.
$W = np.random.rand(shape) * np.sqrt(1/n^{[l-1]})$
Large n ⇒ smaller $W_i$
For ReLUs.
Other for tanh or Xavier initialization.
Error surface of linear neuron with squared error ⇒ quadratic bowl. For multi-layer non-linear nets, error surface is more complicated, but locally piece of a quadratic bowl is usually good approximation.
If not a circle, but eclipse:
Learning rate: If too big, oscillation diverges.
Dataset is highly redundant: Gradient in first half very similar to gradient in second half.
⇒ Update weights with the gradient in the first half, then get gradient for the new weights on the second half. Extreme version: Update weights after each case (“online”)
Mini-batches usually better than online:
Mini-batches need to be balanced for classes.
Guess initial learning rate:
Towards end of mini-batch learning, turn down learning rate ⇒ removes fluctuations in the final weights
Turn down learning rate, when the error stops decreasing; Use error on a separate validation set.
Shift input, so that each component of input vector has zero mean over all training set. Scale input, so that each component of input vector has unit varianze over all training set. Hyperbolic tangent ($2*logistic -1$). produces hidden activation that are roughly zero mean.
⇒ circle error surface.
Big win for linear neuron.
With PCA:
Starting with big learning rate, leads to big weights of the hidden units. Error derivateives for hidden units will become tiny, error will not decrease. ⇒ Usually ends up in plateau, not local minimum
In nets with squared error or corss-entropy error, the beest strat. is output equal to the proportion of time it should be a 1. ⇒ Also ends up in plateau, not local minimum.
If a “ball” on the error surface has enough velocity, it doesn't follow the steepest descent (instead momentum).
$v(t) = \alpha v(t-1) - epsilon \frac{\partial E}{\partial w} (t)$
Velocity also decasy be $\alpha$ (close to 1)
At beginning of learning, there might be large gradients ⇒ Small momentum (e.g. 0.5) at beginning, then e.g. 0.9 or 0.99.
Allows to use larger learning rates than without momentum.
Multi-layer net: appropriate learning rates can vary widely between weights.
Start wit local gain of 1 for every weight. $\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$
Big gains will decay rapidly if oscillations starts. (weight != 1 ⇒ more decrease than increase)
Magnitude of gradients can be very different for different weights and can change during learning
⇒ Hard to choose global learning rate.
Full batch learning: Deal with it by using only sign of the gradient. Weight updates are all of the same magnitude Escapes from platues with tiny gradients quickly.
Rprop: Use sign of gradient, adapt step size for each weight.
Rprop does not work with (small-medium) mini-batches: When learning rate is small, gradient averages. Rprop would e.g. lead to large wight grows.
Rprop: Equivalent of using gradient, but dividing by gradient size.
Problem: Divide by different number for each mini-batch ⇒ Force division number to be similar for adjacent mini-batches.
rmsprop: Keep moving average of the squared gradient for each weight.
$MeanSquare(w,t) = 0.9 \text{MeanSquare}(w,t-1) + 0.1 (\frac{\partial E}{\partial w}(t))^2$
Divide gradient by $\sqrt{\text{MeanSquare}(w,t)}$ ⇒ improved learning.
Andrew Ng's importance ranking:
1. $\alpha$
2. $\beta ~ 0.9$
2. Number of hidden units
2. Mini-batch size
3. Number of layers
3. Learning rate decay
4. Adam $\beta$ parameters
Can values be normalized in layers? $Z^{[l]}(i)...$
First mini-batch $X^{\{1\}}$: \[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]
\[X^{\{2\}} \mathrel{\mathop{\rightarrow}} \dots \]
$Z^l = W^l a^{l-1} + b^l$
But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$. $\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$
Covariance shift (shifting input distribution)
Here no mini-batch, but one sample at a time
Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches