data_mining:neural_network:model_combination

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
data_mining:neural_network:model_combination [2017/08/19 21:54] – [Dropout] phreazerdata_mining:neural_network:model_combination [2017/08/19 22:11] phreazer
Line 85: Line 85:
 More complicated and effective methods than MCMC method: Don't need to wander the space long. More complicated and effective methods than MCMC method: Don't need to wander the space long.
  
-If we compute gradient of cost function on a **random mini-batch**, we will get an ubiased estimate with sampling noise. +If we compute gradient of cost function on a **random mini-batch**, we will get an unbiased estimate with sampling noise.
- +
-====== Dropout ====== +
-Ways to combine output of multiple models: +
-  * MIXTURE: Combine models by averaging their output probabilities. +
-  * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ \sum$ +
- +
-NN with one hidden layer. +
-Randomly omit each hidden unit with probability 0.5, for each training sample. +
-Randomly sampling from 2^H architextures. +
- +
-Sampling form 2^H models, and each model only gets one training example (extreme bagging) +
-Sharing of the weights means that every model is very strongly regularized. +
- +
-What to do at test time? +
- +
-Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models. +
- +
-What if we have more hidden Layers? +
- +
-* Use dropout of 0.5 in every layer. +
-* At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation. +
- +
-Dropout prevents overfitting. +
- +
-For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. +
- +
-===== Inverted dropout ===== +
- +
-Layer $l=3$. +
- +
-keep.prob = 0.8 +
-$d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$ +
-$a3 = np.multiply(a3,d3)$ +
-$a3 /= keep.prob$ // e.g. 50 units => 10 units shut off +
- +
-$Z = Wa+b$ // reduced by 20% => standardize with 0.8 => expected value stays the same+
  • data_mining/neural_network/model_combination.txt
  • Last modified: 2017/08/19 22:12
  • by phreazer