Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:model_combination [2017/08/19 22:01] – [Inverted dropout] phreazer
+++ data_mining:neural_network:model_combination [2017/08/19 22:11] – phreazer
@@ Line 85: / Line 85: @@
 More complicated and effective methods than MCMC method: Don't need to wander the space long.
-If we compute gradient of cost function on a **random mini-batch**, we will get an ubiased estimate with sampling noise.
+If we compute gradient of cost function on a **random mini-batch**, we will get an unbiased estimate with sampling noise.
-====== Dropout ======
-Ways to combine output of multiple models:
-  * MIXTURE: Combine models by averaging their output probabilities.
-  * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$
-NN with one hidden layer.
-Randomly omit each hidden unit with probability 0.5, for each training sample.
-Randomly sampling from 2^H architextures.
-Sampling form 2^H models, and each model only gets one training example (extreme bagging)
-Sharing of the weights means that every model is very strongly regularized.
-What to do at test time?
-Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models.
-What if we have more hidden Layers?
-* Use dropout of 0.5 in every layer.
-* At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation.
-Dropout prevents overfitting.
-For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes.
-===== Inverted dropout =====
-Layer $l=3$.
-$keep.prob = 0.8$
-$d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$
-$a3 = np.multiply(a3,d3)$
-$a3 /= keep.prob$ // e.g. 50 units => 10 units shut off
-$Z = Wa+b$ // reduced by 20% => standardize with 0.8 => expected value stays the same
-Making predictions at test time: No drop out
-**Why does it work?**
-Intuition: Can't rely on any one feature, so spread out of weights => shrink weights.
-Different keep.probs can be set for different layers (e.g. layers with a lot of parameters).
-Used in computer vision.
-Downside: J is not longer well-defined. Performance check problematic (e.g. can set keepprob to one).