This is an old revision of the document!
Gradient descent
Mini batch gradient descent
For t=1, …, number_of_batches:
Vectorized Forward prop on $X^{t}$ $Z^{[1]} = W^{[1]} X^{t} + b^{[1]}$ $A^{[1]} = g^{[1]}(Z^{[1]})$ ... $A^{[L]} = g^{[L]}(Z^{[L]})$ Compute cost $J^{[t]}$ = 1/1000 * ... Backprop to compute gradients for $J^{[t]}$ Update weights $W^{[l]} = W^{[l]} - \alpha d W^{[l]}; b^{[l]} = b^{[l]} - \alpha d b^{[l]}$
Exponentially weighted averages
$V_t = \beta V_{t-1} + (1-\beta) \Theta_t$
$\beta = 0.98$ is smoother than $\beta = 0.5$ (2 days for average in latter case)
$V_\Theta = 0$
$V_\Theta = \beta V + (1-\beta) \Theta_1$
$V_\Theta = \beta V + (1-\beta) \Theta_2$
…
Easy method to compute averages for longer periods
Bias correction
$V_0 = 0$
$V_1 = \beta V_0 + 0,02 \Theta_1$
$V_\Theta = \beta V + 0,02 \Theta_2$
…
Corrected:
$\frac{V_t}{1-\beta^t}$
Gradient Descent with Momentum
RMSprop
Root mean squared