Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:gradient_descent [2018/05/12 20:16] – [Adam] phreazer | data_mining:neural_network:gradient_descent [2018/05/12 20:21] (current) – [Learning rate decay] phreazer | ||
---|---|---|---|
Line 83: | Line 83: | ||
===== Learning rate decay ===== | ===== Learning rate decay ===== | ||
- | $\alpha = \frac{1}{1+ decay_rate * epoch_num} \alpha_0$ | + | $\alpha = \frac{1}{1+ |
or | or | ||
- | $\alpha = 0, | + | $\alpha = 0,95^{\text{epoch_num}} \alpha_0$ |
+ | ===== Saddle points ===== | ||
+ | In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow. |