data_mining:neural_network:gradient_descent

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:gradient_descent [2018/05/12 21:47] – [Gradient Descent with Momentum] phreazerdata_mining:neural_network:gradient_descent [2018/05/12 22:21] (current) – [Learning rate decay] phreazer
Line 54: Line 54:
 $b = b - \alpha V_{db}$ $b = b - \alpha V_{db}$
  
-$\beta$ friction; $V_{db}$ velocity; $dW$ acceleration+$\beta$ friction; $V_{db}$ velocity; $db$ acceleration 
 + 
 +Often $\beta = 0.9$ is used.
 ===== RMSprop ===== ===== RMSprop =====
  
 Root mean squared Root mean squared
  
 +Goal: Slow movements in vertical direction (b direction), fast in horizontal (w direction; bowl is wider than high)
 +
 +Compute dW, db on minibatch
 +
 +$s_{dW} = \beta s_{dW} + (1-\beta) d W^2$ (element-wise squared), small
 +
 +$s_{db} = \beta s_{db} + (1-\beta) d b^2$ large
 +
 +$W = W - \alpha dW/\sqrt{s_{dW}}$ (same for b)
 ===== Adam ===== ===== Adam =====
  
 +Adaptive moment estimation
 +
 +Momentum + RMSprop + Bias correction
 +
 +  * $\alpha$: to be tuned
 +  * $\beta_1$: 0.9
 +  * $\beta_2$: 0.999
 +  * $\sigma$: $10^{-8}$
 +
 +===== Learning rate decay =====
 +
 +$\alpha = \frac{1}{1+ \text{decay_rate} * \text{epoch_num}} \alpha_0$
 +
 +or
 +
 +$\alpha = 0,95^{\text{epoch_num}} \alpha_0$
 +
 +
 +===== Saddle points =====
 +
 +In high-dimensional spaces it's more likely to end up at a saddle point (than in local optima). E.g. 20000 parameter, highly unlikely that it's a local minimum you get stuck. Plateus make learning slow.
  • data_mining/neural_network/gradient_descent.1526154430.txt.gz
  • Last modified: 2018/05/12 21:47
  • by phreazer