Table of Contents

Belief nets

Problems with backprop:

Overcoming limitiation of backprop by using unsupervised learning.

Keep simplicity of gardient descent method for adjusting the weights, but use is for modeling the structure of the sensory input: Adjust weigths to max prob that a generative model would have been generated the sensory input.

Learning objective: Maximise $p(x)$ not $p(y|x)$

Belief net: Sparsely connected, directed acyclic graphs composed of stochastic variables. Clever inference algorithms to compute probs of unobserved nodes (won't work for dense nets).

Inference problem: Infer state of unobserved variables. Learning problem: Adjust interactions between variables to make network more likely to generate training data.

Graphical models vs. NNs

Generative NNs composed of stochastic binary neurons:

Wake-sleep algorithm

Not confuse with Boltzmann-machine learning

Idea: Compute cheap approx of posteriori distribution (wrong inference). Then do maximum likelihood learning.

Trys to manipulate real posteriori distribution.

At each hidden layer, we assume that posterior over hidden configurations factorizes into a product of distributions for each sepearate hidden unit.

Individual probabilities of 3 hidden units in a layer: 0,3; 0,6; 0.8;

Probabilitiy that hidden units have state 1,0,1 if distribution is factorial: $p(1,0,1) = 0.3 * (1-0,6) * 0,8$

Algorithm:

2 sets of weights: W, R

Wake phase:

Sleep phase:

Problems: …

Mode averaging:

Learning layers of features by stacking RBMs

RBM can be learned fairly efficient. Stacking RBMs can learn lots of features. By stacking you get an indefinitely deep belief net.

Combine 2 RBMs to make a DBN.

v ⇐W_1⇒ h_1 copy binary state for each v: h_1 ⇐W_2⇒h_2.

Compose two RBM models

v ⇐W_1= h_1 ⇐W_2⇒ h_2

Bottom layer is unidirectional. ⇒ Not a Boltzman machine. It's a deep belief net.

3 layers

data ⇐W_1= h_1 ⇐W_2= h_2 ⇐W_3⇒ h_3

Generate data:

Averaging factorial distributions

By averaging 2 fact. distr. you don't get a fact. distr. ⇒ Mixture distr.

In RBM the posterior over 4 hidden units is factorial for each visible vector

Consider binary vec (1,1,0,0)

Why does learning work?

Weights of bottom level RBM: p(v|h); p(h|v); p(v,h); p(v); p(h);

Can express RBM model as $p(v) = \sum_h p(h) p(v|h)$

If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a better model than $p(h;W)$ of the aggregated posterior distr. over hidden vectors produced by applying W transposed to the data.

Constrastive version of wake-sleep algorithm

Discriminative fine-tuning for DBNs

Backprop works better with greedy pre-training: * Works wll ans scales to big networks, esp. when we have locality in each layer. * We do not start backpropagation until we have sensible feature detectors.

Fine-tuning only modifies features slightly to get category boundaries right (does not need to discover new features).

Objection: Many features are learned that are useless for a particular discrimination.

Example model (MNIST): Add 10-way softmax at the top and do backprop.

More layers ⇒ lower error with pretraining.

Solutions are qualitative different.

Model real-valued data with RBMS

Mean-field logistic units cannot represent precise inetermediate values (e.g. pixel intensity in image).

Model pixels as Gaussian variables. Alternating Gibbs sampling, with lower learning rate.

Parabolic containment function. (keep visible unit close to b_i). Energy-gradient.

Stepped sigmoid units. Many copies of a stochastic binary unist. All copies have same weiths and bias, b, but they have different fixed offsets to the bias (b-0.5, b-1.5, …).

Structure

Autoencoder, then feed forward NN