Gradient descent

This is an old revision of the document!

For t=1, …, number_of_batches:

Vectorized Forward prop on $X^{t}$
  $Z^{[1]} = W^{[1]} X^{t} + b^{[1]}$
  $A^{[1]} = g^{[1]}(Z^{[1]})$
  ...
  $A^{[L]} = g^{[L]}(Z^{[L]})$
Compute cost $J^{[t]}$ = 1/1000 * ...
Backprop to compute gradients for $J^{[t]}$
Update weights $W^{[l]} = W^{[l]} - \alpha d W^{[l]}; b^{[l]} = b^{[l]} - \alpha d b^{[l]}$

$V_t = \beta V_{t-1} + (1-\beta) \Theta_t$

$\beta = 0.98$ is smoother

Root mean squared

Gradient descent

Mini batch gradient descent

Exponentially weighted averages

Gradient Descent with Momentum

RMSprop

Adam

AE Wiki