data_mining:neural_network:tuning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:tuning [2018/05/12 22:28] – [Hyperparameter tuning] phreazerdata_mining:neural_network:tuning [2018/05/20 15:21] (current) – [Batch norm at test time] phreazer
Line 214: Line 214:
  
  
-How to find good values:+===== How to find good values =====
  
   * Traditional method: Grid of Hyperparameter combinations   * Traditional method: Grid of Hyperparameter combinations
Line 220: Line 220:
   * Coarse to fine (params which work good => focus on smaller square)   * Coarse to fine (params which work good => focus on smaller square)
  
 +==== Scales ====
 +
 +  * Sampling uniformly at random : Number of layers, Number of units
 +  * $\alpha$: Sample at log scale between 0.0001 and 0.1 $10^r$
 +  * Exponentially weighted averages: 0.9 and 0.9999 (10 values vs 1000) $1-\beta=1-10^r$
 +
 +===== Batch normalization =====
 +
 +  * Normally applied with mini-batches
 +
 +Can values be normalized in layers? $Z^{[l]}(i)...$
 +
 +First mini-batch $X^{\{1\}}$: 
 +\[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]
 +
 +\[X^{\{2\}}  \mathrel{\mathop{\rightarrow}} \dots \]
 +
 +$Z^l = W^l a^{l-1} + b^l$
 +
 +But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$.
 +$\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$
 +
 +==== Algo ====
 +
 +  * For minibatch t:
 +    * Compute forward prop for $X^{\{t\}}$
 +      * In each hidden layer use BN to replace $Z^l$ with $\tilde{Z}^l$
 +  * Use backprop to compute $dW^{l}, d\beta^{l}, d\gamma^{l}$
 +  * Update parameters ...
 +
 +==== Why does it work ====
 +
 +Covariance shift (shifting input distribution)
 +
 +  * Batch norm reduces amount in which hidden units shifts around, become more stable (input to later layers)
 +  * Slight regularization effect: Adds some noise, because it's normed on the mini batch
 +
 +==== Batch norm at test time ====
 +
 +Here no mini-batch, but one sample at a time
 +
 +Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches
  • data_mining/neural_network/tuning.1526156924.txt.gz
  • Last modified: 2018/05/12 22:28
  • by phreazer