data_mining:neural_network:belief_nets

This is an old revision of the document!


Belief nets

Problems with backprop:

  • Initialization important
  • Can get start in poor local optima (problematic for deep nets).

Overcoming limitiation of backprop by using unsupervised learning.

Keep simplicity of gardient descent method for adjusting the weights, but use is for modeling the structure of the sensory input: Adjust weigths to max prob that a generative model would have been generated the sensory input.

Learning objective: Maximise $p(x)$ not $p(y|x)$

Belief net: Sparsely connected, directed acyclic graphs composed of stochastic variables. Clever inference algorithms to compute probs of unobserved nodes (won't work for dense nets).

Inference problem: Infer state of unobserved variables. Learning problem: Adjust interactions between variables to make network more likely to generate training data.

Graphical models vs. NNs

  • GM: Expert define graph structure and conditional probabilities. Focus on inference.
  • NNs: Learning training data central.

Generative NNs composed of stochastic binary neurons:

  • Energy-based: Symmetric connections ⇒ Boltzmann machine. If connectiviy gets restricted, it is easier to learn (but only learning 1 layer).
  • Causal model: Directed acyclic graph to get a Sigmoid Belief net.

Not confuse with Boltzmann-machine learning

Idea: Compute cheap approx of posteriori distribution (wrong inference). Then do maximum likelihood learning.

Trys to manipulate real posteriori distribution.

At each hidden layer, we assume that posterior over hidden configurations factorizes into a product of distributions for each sepearate hidden unit.

Individual probabilities of 3 hidden units in a layer: 0,3; 0,6; 0.8;

Probabilitiy that hidden units have state 1,0,1 if distribution is factorial: $p(1,0,1) = 0.3 * (1-0,6) * 0,8$

Algorithm:

2 sets of weights: W, R

Wake phase:

  • Recognition weights R to perform bottom-up pass. Train generative weights to reconstruct activities in each layer from layer above.

Sleep phase:

  • Use generative weights to generate samples from the models. Train recognition weights to reconstruct activities in each layer from the layer below.

Problems: …

Mode averaging:

RBM can be learned fairly efficient. Stacking RBMs can learn lots of features. By stacking you get an indefinitely deep belief net.

Combine 2 RBMs to make a DBN.

v ⇐W_1⇒ h_1 copy binary state for each v: h_1 ⇐W_2⇒h_2.

v ⇐W_1= h_1 ⇐W_2⇒ h_2

Bottom layer is unidirectional. ⇒ Not a Boltzman machine. It's a deep belief net.

data ⇐W_1= h_1 ⇐W_2= h_2 ⇐W_3⇒ h_3

Generate data:

  • Equilibrium sample from top-level RBM (h_2,h_3), by Gibbs sampling for a long time. Defines prior distr. of h_2.
  • Top-Down pass from h_2 to get states for other layers.

Averaging factorial distributions

By averaging 2 fact. distr. you don't get a fact. distr. ⇒ Mixture distr.

In RBM the posterior over 4 hidden units is factorial for each visible vector

  • Posterior for v1: 0.9, 0.9, 0.1, 0.1
  • Posterior for v2: 0.1, 0.1, 0.9, 0.9
  • Aggregated: 0.5, 0.5, 0.5, 0.5

Consider binary vec (1,1,0,0)

  • Posterior for v1 p(1,1,0,0) = 0.9^4 = 0.43
  • Posterior for v2 p(1,1,0,0) = 0.1^4 = 0.0001
  • Aggregated posterior p(1,1,0,0) = 0.215
    • Factorial would be p = 0.5^4

Weights of bottom level RBM: p(v|h); p(h|v); p(v,h); p(v); p(h);

Can express RBM model as $p(v) = \sum_h p(h) p(v|h)$

If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a better model than $p(h;W)$ of the aggregated posterior distr. over hidden vectors produced by applying W transposed to the data.

  • data_mining/neural_network/belief_nets.1493462770.txt.gz
  • Last modified: 2017/04/29 10:46
  • by phreazer