Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:model_combination [2017/08/19 19:56] – [Inverted dropout] phreazer | data_mining:neural_network:model_combination [2017/08/19 20:12] (current) – [Approximating full Bayesian learning in a NN] phreazer | ||
---|---|---|---|
Line 85: | Line 85: | ||
More complicated and effective methods than MCMC method: Don't need to wander the space long. | More complicated and effective methods than MCMC method: Don't need to wander the space long. | ||
- | If we compute gradient of cost function on a **random mini-batch**, | + | If we compute gradient of cost function on a **random mini-batch**, |
====== Dropout ====== | ====== Dropout ====== | ||
- | Ways to combine output of multiple models: | + | See [[data_mining:neural_network:regularization|Regularization]] |
- | * MIXTURE: Combine models by averaging their output probabilities. | + | |
- | * PRODUCT: by geometric mean (typically less than one) $\sqrt{x*y}/ | + | |
- | + | ||
- | NN with one hidden layer. | + | |
- | Randomly omit each hidden unit with probability 0.5, for each training sample. | + | |
- | Randomly sampling from 2^H architextures. | + | |
- | + | ||
- | Sampling form 2^H models, and each model only gets one training example (extreme bagging) | + | |
- | Sharing of the weights means that every model is very strongly regularized. | + | |
- | + | ||
- | What to do at test time? | + | |
- | + | ||
- | Use all hidden units, but halve their outgoing weights. This exactly computes the geometric mean of the predictions of all 2^H models. | + | |
- | + | ||
- | What if we have more hidden Layers? | + | |
- | + | ||
- | * Use dropout of 0.5 in every layer. | + | |
- | * At test time, use mean net, that has all outgoing weights halved. Not the same, as averaging all separate dropped out models, but approximation. | + | |
- | + | ||
- | Dropout prevents overfitting. | + | |
- | + | ||
- | For each training example: For each node toss a coin, e.g. with prob 0.5 and eleminate nodes. | + | |
- | + | ||
- | ===== Inverted dropout ===== | + | |
- | + | ||
- | Layer $l=3$. | + | |
- | + | ||
- | $keep.prob = 0.8$ | + | |
- | + | ||
- | $d3 = np.random.rand(a3.shape[0], a3.shape[i]) < keep.prob$ | + | |
- | + | ||
- | $a3 = np.multiply(a3, | + | |
- | + | ||
- | $a3 /= keep.prob$ // e.g. 50 units => 10 units shut off | + | |
- | + | ||
- | $Z = Wa+b$ // reduced by 20% => standardize with 0.8 => expected value stays the same | + | |
- | + | ||
- | Making predictions at test time: No drop out | + | |