data_mining:neural_network:tuning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
data_mining:neural_network:tuning [2018/05/12 22:28] – [Hyperparameter tuning] phreazerdata_mining:neural_network:tuning [2018/05/20 15:20] – [Why does it work] phreazer
Line 214: Line 214:
  
  
-How to find good values:+===== How to find good values =====
  
   * Traditional method: Grid of Hyperparameter combinations   * Traditional method: Grid of Hyperparameter combinations
   * Better: Random values of combinations   * Better: Random values of combinations
   * Coarse to fine (params which work good => focus on smaller square)   * Coarse to fine (params which work good => focus on smaller square)
 +
 +==== Scales ====
 +
 +  * Sampling uniformly at random : Number of layers, Number of units
 +  * $\alpha$: Sample at log scale between 0.0001 and 0.1 $10^r$
 +  * Exponentially weighted averages: 0.9 and 0.9999 (10 values vs 1000) $1-\beta=1-10^r$
 +
 +===== Batch normalization =====
 +
 +  * Normally applied with mini-batches
 +
 +Can values be normalized in layers? $Z^{[l]}(i)...$
 +
 +First mini-batch $X^{\{1\}}$: 
 +\[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]
 +
 +\[X^{\{2\}}  \mathrel{\mathop{\rightarrow}} \dots \]
 +
 +$Z^l = W^l a^{l-1} + b^l$
 +
 +But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$.
 +$\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$
 +
 +==== Algo ====
 +
 +  * For minibatch t:
 +    * Compute forward prop for $X^{\{t\}}$
 +      * In each hidden layer use BN to replace $Z^l$ with $\tilde{Z}^l$
 +  * Use backprop to compute $dW^{l}, d\beta^{l}, d\gamma^{l}$
 +  * Update parameters ...
 +
 +==== Why does it work ====
 +
 +Covariance shift (shifting input distribution)
 +
 +  * Batch norm reduces amount in which hidden units shifts around, become more stable (input to later layers)
 +  * Slight regularization effect: Adds some noise, because it's normed on the mini batch
 +
 +==== Batch norm at test time ====
 +
 +Here no mini-batch, but one sample at a time
 +
  
  • data_mining/neural_network/tuning.txt
  • Last modified: 2018/05/20 15:21
  • by phreazer