====== Gradient descent ====== ===== Mini batch gradient descent ===== For t=1, ..., number_of_batches: Vectorized Forward prop on $X^{t}$ $Z^{[1]} = W^{[1]} X^{t} + b^{[1]}$ $A^{[1]} = g^{[1]}(Z^{[1]})$ ... $A^{[L]} = g^{[L]}(Z^{[L]})$ Compute cost $J^{[t]}$ = 1/1000 * ... Backprop to compute gradients for $J^{[t]}$ Update weights $W^{[l]} = W^{[l]} - \alpha d W^{[l]}; b^{[l]} = b^{[l]} - \alpha d b^{[l]}$ ===== Exponentially weighted averages ===== $V_t = \beta V_{t-1} + (1-\beta) \Theta_t$ $\beta = 0.98$ is smoother than $\beta = 0.5$ (2 days for average in latter case) $V_\Theta = 0$ $V_\Theta = \beta V + (1-\beta) \Theta_1$ $V_\Theta = \beta V + (1-\beta) \Theta_2$ ... Easy method to compute averages for longer periods ==== Bias correction ==== $V_0 = 0$ $V_1 = \beta V_0 + 0,02 \Theta_1$ $V_\Theta = \beta V + 0,02 \Theta_2$ ... Corrected: $\frac{V_t}{1-\beta^t}$ ===== Gradient Descent with Momentum ===== Idea: Compute exponentially weighted average of gradients and use it to update weights. $V_{dW} = \beta V_{dW} + (1-\beta) dW$ $V_{db} = \beta V_{db} + (1-\beta) db$ $W = W - \alpha V_{dW}$ $b = b - \alpha V_{db}$ $\beta$ friction; $V_{db}$ velocity; $db$ acceleration Often $\beta = 0.9$ is used. ===== RMSprop ===== Root mean squared Goal: Slow movements in vertical direction (b direction), fast in horizontal (w direction; bowl is wider than high) Compute dW, db on minibatch $s_{dW} = \beta s_{dW} + (1-\beta) d W^2$ (element-wise squared), small $s_{db} = \beta s_{db} + (1-\beta) d b^2$ large $W = W - \alpha dW/\sqrt{s_{dW}}$ (same for b) ===== Adam ===== Adaptive moment estimation Momentum + RMSprop + Bias correction * $\alpha$: to be tuned * $\beta_1$: 0.9 * $\beta_2$: 0.999 * $\sigma$: $10^{-8}$ ===== Learning rate decay ===== $\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$ or $\alpha = 0,95^{\text{epoch_num}} \alpha_0$ ===== Saddle points ===== In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow.