Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:gradient_descent [2018/05/12 19:47] – [Gradient Descent with Momentum] phreazer | data_mining:neural_network:gradient_descent [2018/05/12 20:21] (current) – [Learning rate decay] phreazer | ||
---|---|---|---|
Line 54: | Line 54: | ||
$b = b - \alpha V_{db}$ | $b = b - \alpha V_{db}$ | ||
- | $\beta$ friction; $V_{db}$ velocity; $dW$ acceleration | + | $\beta$ friction; $V_{db}$ velocity; $db$ acceleration |
+ | |||
+ | Often $\beta = 0.9$ is used. | ||
===== RMSprop ===== | ===== RMSprop ===== | ||
Root mean squared | Root mean squared | ||
+ | Goal: Slow movements in vertical direction (b direction), fast in horizontal (w direction; bowl is wider than high) | ||
+ | |||
+ | Compute dW, db on minibatch | ||
+ | |||
+ | $s_{dW} = \beta s_{dW} + (1-\beta) d W^2$ (element-wise squared), small | ||
+ | |||
+ | $s_{db} = \beta s_{db} + (1-\beta) d b^2$ large | ||
+ | |||
+ | $W = W - \alpha dW/ | ||
===== Adam ===== | ===== Adam ===== | ||
+ | Adaptive moment estimation | ||
+ | |||
+ | Momentum + RMSprop + Bias correction | ||
+ | |||
+ | * $\alpha$: to be tuned | ||
+ | * $\beta_1$: 0.9 | ||
+ | * $\beta_2$: 0.999 | ||
+ | * $\sigma$: $10^{-8}$ | ||
+ | |||
+ | ===== Learning rate decay ===== | ||
+ | |||
+ | $\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$ | ||
+ | |||
+ | or | ||
+ | |||
+ | $\alpha = 0, | ||
+ | |||
+ | |||
+ | ===== Saddle points ===== | ||
+ | |||
+ | In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow. |