Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:model_combination [2017/04/01 10:45] – phreazer | data_mining:neural_network:model_combination [2017/08/19 20:12] (current) – [Approximating full Bayesian learning in a NN] phreazer | ||
---|---|---|---|
Line 22: | Line 22: | ||
====== Mixtures of Experts ====== | ====== Mixtures of Experts ====== | ||
+ | |||
+ | Suitable for large data sets. | ||
+ | |||
+ | Look at input data, to decide on which model to rely. | ||
+ | => Specialization | ||
+ | |||
+ | Spectrum of models | ||
+ | |||
+ | Local models like nearest neighbors | ||
+ | Global models like polynomyal regression (each parameter depends on all the data) | ||
+ | |||
+ | Use models of intermediate complexity. How to fit data to differen regimes? | ||
+ | |||
+ | Need to cluster training cases into subsets, one for each local model. Each cluster to have a relationship between input and output that can be well-modeled by one local model. | ||
+ | |||
+ | Error function that encourages cooperation | ||
+ | |||
+ | * Compare the **average** of all the predictors with the target and train to **reduce discrepancy**. | ||
+ | $E = (t-< | ||
+ | |||
+ | Error function that encourages specialization | ||
+ | |||
+ | Compare each predictor seperatly with the target. Manager to determine the probability of picking each expert. | ||
+ | |||
+ | $E = (p_i t-< | ||
+ | |||
+ | ====== Full Bayesian Learning ====== | ||
+ | |||
+ | Instead of finding best single setting of the parameters (as in Maximum Likelihood /MAP), compute **full posterior distribution** over **all possible parameter** settings. | ||
+ | |||
+ | Advantage: Complicated models can be used, even if only few data is available. | ||
+ | |||
+ | If you don't have much data, you should use a simple model | ||
+ | * True, but only if you assume that fitting a model measn choosing a **single best setting** of the **parameters** (Ml learning w that maximizes $P(\text{Data}|w)$). | ||
+ | |||
+ | If you use full posterior distribution, | ||
+ | |||
+ | Example: learn lots of polynomial (distribution), | ||
+ | |||
+ | ====== Approximating full Bayesian learning in a NN ====== | ||
+ | |||
+ | * NN with few parameters. Put grid over parameter space and evaluate $p(W|D)$ at each grid-point. (xpensive, but no local optimum issues). | ||
+ | * After evaluating each grid point, we use all of them to make predictions on test data. | ||
+ | * Expensive, but works much better than ML learning, when posteriror is vague or multimodal (data is scarce). | ||
+ | |||
+ | Monte Carlo method | ||
+ | |||
+ | Idea: Might be good enough to sample weight vectors according to their posterior probabilities. | ||
+ | |||
+ | $p(y_{\text{test}} | \text{input}_\text{test}, | ||
+ | |||
+ | Sample weight vectors $p(W_i|D)$. | ||
+ | |||
+ | In Backpropagation, | ||
+ | |||
+ | With sampling: Add some gaussion noise to weight vector, after each update. | ||
+ | |||
+ | Markov Chain Monte Carlo: | ||
+ | |||
+ | If we use just the right amount of noise, and if we let thei weight vector wander around for long enough before we take a sample, we will get an ubiased sample form the true posterior over weight vectors. | ||
+ | |||
+ | More complicated and effective methods than MCMC method: Don't need to wander the space long. | ||
+ | |||
+ | If we compute gradient of cost function on a **random mini-batch**, | ||
+ | |||
+ | ====== Dropout ====== | ||
+ | See [[data_mining: | ||