Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:gradient_descent [2018/05/12 19:40] – [Exponentially weighted averages] phreazer | data_mining:neural_network:gradient_descent [2018/05/12 20:21] (current) – [Learning rate decay] phreazer | ||
---|---|---|---|
Line 39: | Line 39: | ||
... | ... | ||
- | Corected: | + | Corrected: |
$\frac{V_t}{1-\beta^t}$ | $\frac{V_t}{1-\beta^t}$ | ||
===== Gradient Descent with Momentum ===== | ===== Gradient Descent with Momentum ===== | ||
+ | Idea: Compute exponentially weighted average of gradients and use it to update weights. | ||
+ | |||
+ | $V_{dW} = \beta V_{dW} + (1-\beta) dW$ | ||
+ | |||
+ | $V_{db} = \beta V_{db} + (1-\beta) db$ | ||
+ | |||
+ | $W = W - \alpha V_{dW}$ | ||
+ | |||
+ | $b = b - \alpha V_{db}$ | ||
+ | |||
+ | $\beta$ friction; $V_{db}$ velocity; $db$ acceleration | ||
+ | |||
+ | Often $\beta = 0.9$ is used. | ||
===== RMSprop ===== | ===== RMSprop ===== | ||
Root mean squared | Root mean squared | ||
+ | Goal: Slow movements in vertical direction (b direction), fast in horizontal (w direction; bowl is wider than high) | ||
+ | |||
+ | Compute dW, db on minibatch | ||
+ | |||
+ | $s_{dW} = \beta s_{dW} + (1-\beta) d W^2$ (element-wise squared), small | ||
+ | |||
+ | $s_{db} = \beta s_{db} + (1-\beta) d b^2$ large | ||
+ | |||
+ | $W = W - \alpha dW/ | ||
===== Adam ===== | ===== Adam ===== | ||
+ | Adaptive moment estimation | ||
+ | |||
+ | Momentum + RMSprop + Bias correction | ||
+ | |||
+ | * $\alpha$: to be tuned | ||
+ | * $\beta_1$: 0.9 | ||
+ | * $\beta_2$: 0.999 | ||
+ | * $\sigma$: $10^{-8}$ | ||
+ | |||
+ | ===== Learning rate decay ===== | ||
+ | |||
+ | $\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$ | ||
+ | |||
+ | or | ||
+ | |||
+ | $\alpha = 0, | ||
+ | |||
+ | |||
+ | ===== Saddle points ===== | ||
+ | |||
+ | In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow. |