Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:gradient_descent [2018/05/12 20:12] – [Adam] phreazer | data_mining:neural_network:gradient_descent [2018/05/12 20:21] (current) – [Learning rate decay] phreazer | ||
---|---|---|---|
Line 80: | Line 80: | ||
* $\beta_2$: 0.999 | * $\beta_2$: 0.999 | ||
* $\sigma$: $10^{-8}$ | * $\sigma$: $10^{-8}$ | ||
+ | |||
+ | ===== Learning rate decay ===== | ||
+ | |||
+ | $\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$ | ||
+ | |||
+ | or | ||
+ | |||
+ | $\alpha = 0, | ||
+ | |||
+ | |||
+ | ===== Saddle points ===== | ||
+ | |||
+ | In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow. |