Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:gradient_descent [2018/05/12 20:10] – phreazer | data_mining:neural_network:gradient_descent [2018/05/12 20:21] (current) – [Learning rate decay] phreazer | ||
---|---|---|---|
Line 71: | Line 71: | ||
$W = W - \alpha dW/ | $W = W - \alpha dW/ | ||
===== Adam ===== | ===== Adam ===== | ||
+ | |||
+ | Adaptive moment estimation | ||
Momentum + RMSprop + Bias correction | Momentum + RMSprop + Bias correction | ||
+ | * $\alpha$: to be tuned | ||
+ | * $\beta_1$: 0.9 | ||
+ | * $\beta_2$: 0.999 | ||
+ | * $\sigma$: $10^{-8}$ | ||
+ | |||
+ | ===== Learning rate decay ===== | ||
+ | |||
+ | $\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$ | ||
+ | |||
+ | or | ||
+ | |||
+ | $\alpha = 0, | ||
+ | |||
+ | |||
+ | ===== Saddle points ===== | ||
+ | |||
+ | In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow. |