Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:tuning [2018/05/12 22:40] – [Scales] phreazer
+++ data_mining:neural_network:tuning [2018/05/20 15:21] (current) – [Batch norm at test time] phreazer
@@ Line 228: / Line 228: @@
 ===== Batch normalization =====
+  * Normally applied with mini-batches
+Can values be normalized in layers? $Z^{[l]}(i)...$
+First mini-batch $X^{\{1\}}$:
+\[X^{\{1\}} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{1}, b^{1}}}} Z^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{\beta^{1}, \gamma^{1}}}_{\mathrm{B_N}}} \tilde{Z}^{1} \mathrel{\mathop{\rightarrow}} g^{1}(\tilde{Z}^{1}) a^{1} \mathrel{\mathop{\rightarrow}^{\mathrm{W^{2}, b^{2}}}} Z^{2} \mathrel{\mathop{\rightarrow}} \dots \]
+\[X^{\{2\}}  \mathrel{\mathop{\rightarrow}} \dots \]
+$Z^l = W^l a^{l-1} + b^l$
+But with mini-batch mean gets zeroed, no $b$ needed, instead $\beta$.
+$\tilde{Z}^l = \gamma^l Z^{l} + \beta^l$
+==== Algo ====
+  * For minibatch t:
+    * Compute forward prop for $X^{\{t\}}$
+      * In each hidden layer use BN to replace $Z^l$ with $\tilde{Z}^l$
+  * Use backprop to compute $dW^{l}, d\beta^{l}, d\gamma^{l}$
+  * Update parameters ...
+==== Why does it work ====
+Covariance shift (shifting input distribution)
+  * Batch norm reduces amount in which hidden units shifts around, become more stable (input to later layers)
+  * Slight regularization effect: Adds some noise, because it's normed on the mini batch
+==== Batch norm at test time ====
+Here no mini-batch, but one sample at a time
+Estimate $\sigma^2, \mu$ using exponentially weighted average across mini-batches