data_mining:neural_network:gradient_descent

This is an old revision of the document!


Gradient descent

For t=1, …, number_of_batches:

Vectorized Forward prop on $X^{t}$
  $Z^{[1]} = W^{[1]} X^{t} + b^{[1]}$
  $A^{[1]} = g^{[1]}(Z^{[1]})$
  ...
  $A^{[L]} = g^{[L]}(Z^{[L]})$
Compute cost $J^{[t]}$ = 1/1000 * ...
Backprop to compute gradients for $J^{[t]}$
Update weights $W^{[l]} = W^{[l]} - \alpha d W^{[l]}; b^{[l]} = b^{[l]} - \alpha d b^{[l]}$

$V_t = \beta V_{t-1} + (1-\beta) \Theta_t$

$\beta = 0.98$ is smoother than $\beta = 0.5$ (2 days for average in latter case)

$V_\Theta = 0$

$V_\Theta = \beta V + (1-\beta) \Theta_1$

$V_\Theta = \beta V + (1-\beta) \Theta_2$

Easy method to compute averages for longer periods

$V_0 = 0$

$V_1 = \beta V_0 + 0,02 \Theta_1$

$V_\Theta = \beta V + 0,02 \Theta_2$

Corrected:

$\frac{V_t}{1-\beta^t}$

Idea: Compute exponentially weighted average of gradients and use it to update weights.

$V_{dW} = \beta V_{dW} + (1-\beta) dW$

$V_{db} = \beta V_{db} + (1-\beta) db$

$W = W - \alpha V_{dW}$

$b = b - \alpha V_{db}$

$\beta$ friction; $V_{db}$ velocity; $db$ acceleration

Often $\beta = 0.9$ is used.

Root mean squared

Goal: Slow movements in vertical direction (b direction), fast in horizontal (w direction; bowl is wider than high)

Compute dW, db on minibatch

$s_{dW} = \beta s_{dW} + (1-\beta) d W^2$ (element-wise squared), small

$s_{db} = \beta s_{db} + (1-\beta) d b^2$ large

$W = W - \alpha dW/\sqrt{s_{dW}}$ (same for b)

  • data_mining/neural_network/gradient_descent.1526154978.txt.gz
  • Last modified: 2018/05/12 21:56
  • by phreazer