This is an old revision of the document!
Belief nets
Problems with backprop:
- Initialization important
- Can get start in poor local optima (problematic for deep nets).
Overcoming limitiation of backprop by using unsupervised learning.
Keep simplicity of gardient descent method for adjusting the weights, but use is for modeling the structure of the sensory input: Adjust weigths to max prob that a generative model would have been generated the sensory input.
Learning objective: Maximise $p(x)$ not $p(y|x)$
Belief net: Sparsely connected, directed acyclic graphs composed of stochastic variables. Clever inference algorithms to compute probs of unobserved nodes (won't work for dense nets).
Inference problem: Infer state of unobserved variables. Learning problem: Adjust interactions between variables to make network more likely to generate training data.
Graphical models vs. NNs
- GM: Expert define graph structure and conditional probabilities. Focus on inference.
- NNs: Learning training data central.
Generative NNs composed of stochastic binary neurons:
- Energy-based: Symmetric connections ⇒ Boltzmann machine. If connectiviy gets restricted, it is easier to learn (but only learning 1 layer).
- Causal model: Directed acyclic graph to get a Sigmoid Belief net.
Wake-sleep algorithm
Not confuse with Boltzmann-machine learning
Idea: Compute cheap approx of posteriori distribution (wrong inference). Then do maximum likelihood learning.
Trys to manipulate real posteriori distribution.
At each hidden layer, we assume that posterior over hidden configurations factorizes into a product of distributions for each sepearate hidden unit.
Individual probabilities of 3 hidden units in a layer: 0,3; 0,6; 0.8;
Probabilitiy that hidden units have state 1,0,1 if distribution is factorial: $p(1,0,1) = 0.3 * (1-0,6) * 0,8$
Algorithm:
2 sets of weights: W, R
Wake phase:
- Recognition weights R to perform bottom-up pass. Train generative weights to reconstruct activities in each layer from layer above.
Sleep phase:
- Use generative weights to generate samples from the models. Train recognition weights to reconstruct activities in each layer from the layer below.
Problems: …
Mode averaging: