Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:tuning [2018/05/20 14:59] – [Batch normalization] phreazer | data_mining:neural_network:tuning [2018/05/20 15:21] (current) – [Batch norm at test time] phreazer | ||
---|---|---|---|
Line 232: | Line 232: | ||
Can values be normalized in layers? $Z^{[l]}(i)...$ | Can values be normalized in layers? $Z^{[l]}(i)...$ | ||
- | First mini-batch $X^{\{1\}$: | + | First mini-batch $X^{\{1\}}$: |
\[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, | \[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, | ||
+ | \[X^{\{2\}} | ||
+ | $Z^l = W^l a^{l-1} + b^l$ | ||
+ | |||
+ | But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$. | ||
+ | $\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$ | ||
+ | |||
+ | ==== Algo ==== | ||
+ | |||
+ | * For minibatch t: | ||
+ | * Compute forward prop for $X^{\{t\}}$ | ||
+ | * In each hidden layer use BN to replace $Z^l$ with $\tilde{Z}^l$ | ||
+ | * Use backprop to compute $dW^{l}, d\beta^{l}, d\gamma^{l}$ | ||
+ | * Update parameters ... | ||
+ | |||
+ | ==== Why does it work ==== | ||
+ | |||
+ | Covariance shift (shifting input distribution) | ||
+ | |||
+ | * Batch norm reduces amount in which hidden units shifts around, become more stable (input to later layers) | ||
+ | * Slight regularization effect: Adds some noise, because it's normed on the mini batch | ||
+ | |||
+ | ==== Batch norm at test time ==== | ||
+ | |||
+ | Here no mini-batch, but one sample at a time | ||
+ | |||
+ | Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches |