Normalize training sets

Center and standardize

$\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}$

$x = x-\mu$


$x /= \sigma^2$

Use same $\mu$ and $sigma$ for all data partitions (train/test/…).

When features are on very different scales ⇒ cost function can be an elongated bowl (small learning rate might be needed).

Exploding / vanishing gradients

When weights are larger or smaller than 1 than they can grow or vanish exponentially in a deep network. Small learning rate must be used ⇒ slow.

$W = np.random.rand(shape) * np.sqrt(1/n^{[l-1]})$

Large n ⇒ smaller $W_i$

For ReLUs.

Other for tanh or Xavier initialization.

Convergence speed (quadratic bowl)

Error surface of linear neuron with squared error ⇒ quadratic bowl. For multi-layer non-linear nets, error surface is more complicated, but locally piece of a quadratic bowl is usually good approximation.

If not a circle, but eclipse:

  • Gradient big, but we only want to travel small steps.
  • Gradient small to direction in which we want to travel large distance.

Learning rate: If too big, oscillation diverges.

  • Want to move quickly in directions with small, but consistent gradients.
  • Want to move slowly in directions with big but inconsistent gradients.

Stochastic gradient descent

Dataset is highly redundant: Gradient in first half very similar to gradient in second half.

⇒ Update weights with the gradient in the first half, then get gradient for the new weights on the second half. Extreme version: Update weights after each case (“online”)

Mini-batches usually better than online:

  • Less compuation
  • Matrix multiplications

Mini-batches need to be balanced for classes.

Guess initial learning rate:

  • If error keeps getting worse, or oscillates ⇒ reduce learning rate
  • If error is falling consistently, but slowly ⇒ increase learning rate

Towards end of mini-batch learning, turn down learning rate ⇒ removes fluctuations in the final weights

Turn down learning rate, when the error stops decreasing; Use error on a separate validation set.

  • Small random values (not identical).
  • If hidden unit has big fan-in, small changes in incoming weights can cause larning to overshoot ⇒ Want smaller incoming weights when fan-in is big. Weight init proportional to $\sqrt{\text{fan-in}}$ (Scaling learning way is possible in same way).

Shift input, so that each component of input vector has zero mean over all training set. Scale input, so that each component of input vector has unit varianze over all training set. Hyperbolic tangent ($2*logistic -1$). produces hidden activation that are roughly zero mean.

⇒ circle error surface.

Big win for linear neuron.

With PCA:

  • Drop principal components with smalles eigenvalues
  • Divide remaining principal components by square root of their eigenvalues (For linear neuron, this converts axis aligned elliptical error surface into circular one).

Starting with big learning rate, leads to big weights of the hidden units. Error derivateives for hidden units will become tiny, error will not decrease. ⇒ Usually ends up in plateau, not local minimum

In nets with squared error or corss-entropy error, the beest strat. is output equal to the proportion of time it should be a 1. ⇒ Also ends up in plateau, not local minimum.

  • Use momentum : Use gradient to change velocity of weight particle (not position).
  • Use seperate adaptive learning rates for each parameter (using the consistency of the gradient of a parameter).
  • rmsprop: Divide learning rate for a weight by a running average of the magnitues of recent gradients for that weight (mini-batch version of using the sign of the gradient).
  • Fancy methods from opti literature, that makes use of curvature information

If a “ball” on the error surface has enough velocity, it doesn't follow the steepest descent (instead momentum).

  • Damp oscillations, by compining gradients with opposite signs.
  • Builds speed in directions with gentle but consistent gradient.

$v(t) = \alpha v(t-1) - epsilon \frac{\partial E}{\partial w} (t)$

Velocity also decasy be $\alpha$ (close to 1)

At beginning of learning, there might be large gradients ⇒ Small momentum (e.g. 0.5) at beginning, then e.g. 0.9 or 0.99.

Allows to use larger learning rates than without momentum.


  • First make big jump in direction of previous accumulated gradient.
  • Then measure the gradien where you end up an make a correction.

Multi-layer net: appropriate learning rates can vary widely between weights.

  • Global learning rate
  • Local gain for each weight

Start wit local gain of 1 for every weight. $\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$

  • If the gradient for that weight does not change sign
    • increase local gain (small additive increase: $+ \delta$)
  • Else
    • decrease gain (multiplicative decreases: $* 1-\delta$)

Big gains will decay rapidly if oscillations starts. (weight != 1 ⇒ more decrease than increase)

  • Limit gains [0.1, 10]
  • Use full batch learning or very big mini-batches.
  • Adaptive learning rates can be combined with momentum
  • Adaptive learning rates only deal with axis-aligned effects.


Magnitude of gradients can be very different for different weights and can change during learning

⇒ Hard to choose global learning rate.

Full batch learning: Deal with it by using only sign of the gradient. Weight updates are all of the same magnitude Escapes from platues with tiny gradients quickly.

Rprop: Use sign of gradient, adapt step size for each weight.

  • Increase step size multiplicatively (1.2) if signs of two gradients agree.
  • Decrease step size multiplicatively (0.5), otherwise
  • Limit step size to be less than 50 more than a millonth.

Rprop does not work with (small-medium) mini-batches: When learning rate is small, gradient averages. Rprop would e.g. lead to large wight grows.


Rprop: Equivalent of using gradient, but dividing by gradient size.

Problem: Divide by different number for each mini-batch ⇒ Force division number to be similar for adjacent mini-batches.

rmsprop: Keep moving average of the squared gradient for each weight.

$MeanSquare(w,t) = 0.9 \text{MeanSquare}(w,t-1) + 0.1 (\frac{\partial E}{\partial w}(t))^2$

Divide gradient by $\sqrt{\text{MeanSquare}(w,t)}$ ⇒ improved learning.

  • Small datasets / bigger datasets without redundancy
    • Full-batch
      • Conugate gradient, LBGGS, adaptive learning rates, rprop, …
  • Big, redundant datasets
    • Mini-batches
      • Gradient descent with momentum
      • rmsprop
      • Lecun's latest recipe
  • No generic recipe:
    • Structure: Deep nets (with bottle necks), Recurrent nets, Wide shallow nets
    • Tasks: Require accurate weights, have many rare cases (words).

Hyperparameter tuning

Andrew Ng's importance ranking:

1. $\alpha$

2. $\beta ~ 0.9$

2. Number of hidden units

2. Mini-batch size

3. Number of layers

3. Learning rate decay

4. Adam $\beta$ parameters

  • Traditional method: Grid of Hyperparameter combinations
  • Better: Random values of combinations
  • Coarse to fine (params which work good ⇒ focus on smaller square)
  • Sampling uniformly at random : Number of layers, Number of units
  • $\alpha$: Sample at log scale between 0.0001 and 0.1 $10^r$
  • Exponentially weighted averages: 0.9 and 0.9999 (10 values vs 1000) $1-\beta=1-10^r$
  • Normally applied with mini-batches

Can values be normalized in layers? $Z^{[l]}(i)...$

First mini-batch $X^{\{1\}}$: \[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]

\[X^{\{2\}} \mathrel{\mathop{\rightarrow}} \dots \]

$Z^l = W^l a^{l-1} + b^l$

But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$. $\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$

  • For minibatch t:
    • Compute forward prop for $X^{\{t\}}$
      • In each hidden layer use BN to replace $Z^l$ with $\tilde{Z}^l$
  • Use backprop to compute $dW^{l}, d\beta^{l}, d\gamma^{l}$
  • Update parameters …

Covariance shift (shifting input distribution)

  • Batch norm reduces amount in which hidden units shifts around, become more stable (input to later layers)
  • Slight regularization effect: Adds some noise, because it's normed on the mini batch

Here no mini-batch, but one sample at a time

Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches

  • data_mining/neural_network/tuning.txt
  • Last modified: 2018/05/20 15:21
  • by phreazer