data_mining:neural_network:model_combination

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:model_combination [2017/04/01 12:45] phreazerdata_mining:neural_network:model_combination [2017/08/19 22:12] (current) – [Approximating full Bayesian learning in a NN] phreazer
Line 22: Line 22:
  
 ====== Mixtures of Experts ====== ====== Mixtures of Experts ======
 +
 +Suitable for large data sets.
 +
 +Look at input data, to decide on which model to rely.
 +=> Specialization
 +
 +Spectrum of models
 +
 +Local models like nearest neighbors
 +Global models like polynomyal regression (each parameter depends on all the data)
 +
 +Use models of intermediate complexity. How to fit data to differen regimes?
 +
 +Need to cluster training cases into subsets, one for each local model. Each cluster to have a relationship between input and output that can be well-modeled by one local model.
 +
 +Error function that encourages cooperation
 +
 +* Compare the **average** of all the predictors with the target and train to **reduce discrepancy**.
 +$E = (t-<y_i>_i)^2$
 +
 +Error function that encourages specialization
 +
 +Compare each predictor seperatly with the target. Manager to determine the probability of picking each expert.
 +
 +$E = (p_i t-<y_i>_i)^2$
 +
 +====== Full Bayesian Learning ======
 +
 +Instead of finding best single setting of the parameters (as in Maximum Likelihood /MAP), compute **full posterior distribution** over **all possible parameter** settings.
 +
 +Advantage: Complicated models can be used, even if only few data is available.
 +
 +If you don't have much data, you should use a simple model
 +  * True, but only if you assume that fitting a model measn choosing a **single best setting** of the **parameters** (Ml learning w that maximizes $P(\text{Data}|w)$).
 +
 +If you use full posterior distribution, overfitting disappears: You will get very vague predictions, because many different parameter settings have significant posterior probability (Learn $P(w|\text{Data})$).
 +
 +Example: learn lots of polynomial (distribution), average.
 +
 +====== Approximating full Bayesian learning in a NN ======
 +
 +  * NN with few parameters. Put grid over parameter space and evaluate $p(W|D)$ at each grid-point. (xpensive, but no local optimum issues).
 +  * After evaluating each grid point, we use all of them to make predictions on test data.
 +    * Expensive, but works much better than ML learning, when posteriror is vague or multimodal (data is scarce).
 +
 +Monte Carlo method
 +
 +Idea: Might be good enough to sample weight vectors according to their posterior probabilities.
 +
 +$p(y_{\text{test}} | \text{input}_\text{test}, D) = \sum_i p(W_i|D) p(y_{\text{test}} | \text{input}_\text{test}, W_i)$
 +
 +Sample weight vectors $p(W_i|D)$.
 +
 +In Backpropagation, we keep moving weights in the direction that decreases the costs.
 +
 +With sampling: Add some gaussion noise to weight vector, after each update.
 +
 +Markov Chain Monte Carlo:
 +
 +If we use just the right amount of noise, and if we let thei weight vector wander around for long enough before we take a sample, we will get an ubiased sample form the true posterior over weight vectors.
 +
 +More complicated and effective methods than MCMC method: Don't need to wander the space long.
 +
 +If we compute gradient of cost function on a **random mini-batch**, we will get an unbiased estimate with sampling noise.
 +
 +====== Dropout ======
 +See [[data_mining:neural_network:regularization|Regularization]]
  
  • data_mining/neural_network/model_combination.1491043547.txt.gz
  • Last modified: 2017/04/01 12:45
  • by phreazer