# Combine models

Bias-Variance trade-off

Averaging models helps most, when models make very **different predictions**.

In regression, error can be decomposed into bias and variance term.

- Bias big if model has too little capacity to fit the data
- Variance big, if model has so much capacity, that it is good at fitting the sampling error in each particular training set.

Averaging ⇒ Averaging the variance, while bias stays low.

⇒ Try to create individual predictors that disagree

Options:

- Different kind of models (Decision trees, SVMs, Gaussian Process Models)
- NN models: Differen hidden layers, differen number of units per layer, types of units, types or strength of weight penalty, learning algorithms
- Train model on different subsets of the data:
- Bagging: Sampling with replacement, like Random Forests. Very expensive for NN
- Boosting: Train sequence of low capacity models, weight training cases differently for each model in the seuqence (boost up weights of cases that previous models got wrong). Focus on modeling tricky cases.

# Mixtures of Experts

Suitable for large data sets.

Look at input data, to decide on which model to rely. ⇒ Specialization

Spectrum of models

Local models like nearest neighbors Global models like polynomyal regression (each parameter depends on all the data)

Use models of intermediate complexity. How to fit data to differen regimes?

Need to cluster training cases into subsets, one for each local model. Each cluster to have a relationship between input and output that can be well-modeled by one local model.

Error function that encourages cooperation

* Compare the **average** of all the predictors with the target and train to **reduce discrepancy**.
$E = (t-<y_i>_i)^2$

Error function that encourages specialization

Compare each predictor seperatly with the target. Manager to determine the probability of picking each expert.

$E = (p_i t-<y_i>_i)^2$

# Full Bayesian Learning

Instead of finding best single setting of the parameters (as in Maximum Likelihood /MAP), compute **full posterior distribution** over **all possible parameter** settings.

Advantage: Complicated models can be used, even if only few data is available.

If you don't have much data, you should use a simple model

- True, but only if you assume that fitting a model measn choosing a
**single best setting**of the**parameters**(Ml learning w that maximizes $P(\text{Data}|w)$).

If you use full posterior distribution, overfitting disappears: You will get very vague predictions, because many different parameter settings have significant posterior probability (Learn $P(w|\text{Data})$).

Example: learn lots of polynomial (distribution), average.

# Approximating full Bayesian learning in a NN

- NN with few parameters. Put grid over parameter space and evaluate $p(W|D)$ at each grid-point. (xpensive, but no local optimum issues).
- After evaluating each grid point, we use all of them to make predictions on test data.
- Expensive, but works much better than ML learning, when posteriror is vague or multimodal (data is scarce).

Monte Carlo method

Idea: Might be good enough to sample weight vectors according to their posterior probabilities.

$p(y_{\text{test}} | \text{input}_\text{test}, D) = \sum_i p(W_i|D) p(y_{\text{test}} | \text{input}_\text{test}, W_i)$

Sample weight vectors $p(W_i|D)$.

In Backpropagation, we keep moving weights in the direction that decreases the costs.

With sampling: Add some gaussion noise to weight vector, after each update.

Markov Chain Monte Carlo:

If we use just the right amount of noise, and if we let thei weight vector wander around for long enough before we take a sample, we will get an ubiased sample form the true posterior over weight vectors.

More complicated and effective methods than MCMC method: Don't need to wander the space long.

If we compute gradient of cost function on a **random mini-batch**, we will get an unbiased estimate with sampling noise.

# Dropout

See Regularization