Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:gradient_descent [2018/05/12 21:55] – [RMSprop] phreazer
+++ data_mining:neural_network:gradient_descent [2018/05/12 22:21] (current) – [Learning rate decay] phreazer
@@ Line 64: / Line 64: @@
 Compute dW, db on minibatch
 $s_{dW} = \beta s_{dW} + (1-\beta) d W^2$ (element-wise squared), small
 $s_{db} = \beta s_{db} + (1-\beta) d b^2$ large
@@ Line 70: / Line 72: @@
 ===== Adam =====
+Adaptive moment estimation
+Momentum + RMSprop + Bias correction
+  * $\alpha$: to be tuned
+  * $\beta_1$: 0.9
+  * $\beta_2$: 0.999
+  * $\sigma$: $10^{-8}$
+===== Learning rate decay =====
+$\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$
+or
+$\alpha = 0,95^{\text{epoch_num}} \alpha_0$
+===== Saddle points =====
+In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow.