Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:gradient_descent [2018/05/12 20:12] – [Adam] phreazer | data_mining:neural_network:gradient_descent [2018/05/12 20:21] (current) – [Learning rate decay] phreazer | ||
---|---|---|---|
Line 79: | Line 79: | ||
* $\beta_1$: 0.9 | * $\beta_1$: 0.9 | ||
* $\beta_2$: 0.999 | * $\beta_2$: 0.999 | ||
- | * $\sigma$: 10^{-8} | + | * $\sigma$: |
+ | |||
+ | ===== Learning rate decay ===== | ||
+ | |||
+ | $\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$ | ||
+ | |||
+ | or | ||
+ | |||
+ | $\alpha = 0, | ||
+ | |||
+ | |||
+ | ===== Saddle points ===== | ||
+ | |||
+ | In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow. |