====== Belief nets ====== Problems with backprop: * Initialization important * Can get start in poor local optima (problematic for deep nets). Overcoming limitiation of backprop by using unsupervised learning. Keep simplicity of gardient descent method for adjusting the weights, but use is for modeling the structure of the sensory input: Adjust weigths to max prob that a generative model would have been generated the sensory input. Learning objective: Maximise $p(x)$ not $p(y|x)$ Belief net: Sparsely connected, directed acyclic graphs composed of stochastic variables. Clever inference algorithms to compute probs of unobserved nodes (won't work for dense nets). Inference problem: Infer state of unobserved variables. Learning problem: Adjust interactions between variables to make network more likely to generate training data. Graphical models vs. NNs * GM: Expert define graph structure and conditional probabilities. Focus on inference. * NNs: Learning training data central. Generative NNs composed of **stochastic binary neurons**: * Energy-based: Symmetric connections => Boltzmann machine. If connectiviy gets restricted, it is easier to learn (but only learning 1 layer). * Causal model: Directed acyclic graph to get a Sigmoid Belief net. ===== Wake-sleep algorithm ===== Not confuse with Boltzmann-machine learning Idea: Compute cheap **approx** of posteriori distribution (wrong inference). Then do maximum likelihood learning. Trys to manipulate real posteriori distribution. At each hidden layer, we assume that posterior over hidden configurations factorizes into a product of distributions for each sepearate hidden unit. Individual probabilities of 3 hidden units in a layer: 0,3; 0,6; 0.8; Probabilitiy that hidden units have state 1,0,1 if distribution is factorial: $p(1,0,1) = 0.3 * (1-0,6) * 0,8$ Algorithm: 2 sets of weights: W, R Wake phase: * Recognition weights R to perform bottom-up pass. Train generative weights to reconstruct activities in each layer from layer above. Sleep phase: * Use generative weights to generate samples from the models. Train recognition weights to reconstruct activities in each layer from the layer below. Problems: ... Mode averaging: ===== Learning layers of features by stacking RBMs ===== RBM can be learned fairly efficient. Stacking RBMs can learn lots of features. By stacking you get an indefinitely deep belief net. Combine 2 RBMs to make a DBN. v <=W_1=> h_1 copy binary state for each v: h_1 <=W_2=>h_2. ==== Compose two RBM models ==== v <=W_1= h_1 <=W_2=> h_2 Bottom layer is unidirectional. => Not a Boltzman machine. It's a deep belief net. ==== 3 layers ==== data <=W_1= h_1 <=W_2= h_2 <=W_3=> h_3 Generate data: * Equilibrium sample from **top-level** RBM (h_2,h_3), by Gibbs sampling for a long time. Defines prior distr. of h_2. * Top-Down pass from h_2 to get states for other layers. === Averaging factorial distributions === By averaging 2 fact. distr. you don't get a fact. distr. => Mixture distr. In RBM the posterior over 4 hidden units is factorial for each visible vector * Posterior for v1: 0.9, 0.9, 0.1, 0.1 * Posterior for v2: 0.1, 0.1, 0.9, 0.9 * Aggregated: 0.5, 0.5, 0.5, 0.5 Consider binary vec (1,1,0,0) * Posterior for v1 p(1,1,0,0) = 0.9^4 = 0.43 * Posterior for v2 p(1,1,0,0) = 0.1^4 = 0.0001 * Aggregated posterior p(1,1,0,0) = 0.215 * Factorial would be p = 0.5^4 ==== Why does learning work? ==== Weights of bottom level RBM: p(v|h); p(h|v); p(v,h); p(v); p(h); Can express RBM model as $p(v) = \sum_h p(h) p(v|h)$ If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a **better model than $p(h;W)$** of the **aggregated** posterior distr. over hidden vectors produced by applying W transposed to the data. ==== Constrastive version of wake-sleep algorithm ==== ==== Discriminative fine-tuning for DBNs ==== * First learn one layer at a time by stacking RBMs. * Use this pre-training (found initial weights), which can be fine-tuned by a local search procedure. * Perviously: Constrastive wake-sleep fine-tuning the model to be better at generation. * Now: Use Backprop to fine-tune the model to be better at discrimination. Backprop works better with greedy pre-training: * Works wll ans scales to big networks, esp. when we have locality in each layer. * We do not start backpropagation until we have sensible feature detectors. * Initial gradients are sensibel, backprop only needs to perform a local search from a sensible start point. Fine-tuning only modifies features slightly to get category boundaries right (does not need to discover new features). Objection: Many features are learned that are useless for a particular discrimination. Example model (MNIST): Add 10-way softmax at the top and do backprop. More layers => lower error with pretraining. Solutions are qualitative different. ==== Model real-valued data with RBMS ==== Mean-field logistic units cannot represent precise inetermediate values (e.g. pixel intensity in image). Model pixels as Gaussian variables. Alternating Gibbs sampling, with lower learning rate. Parabolic containment function. (keep visible unit close to b_i). Energy-gradient. Stepped sigmoid units. Many copies of a stochastic binary unist. All copies have same weiths and bias, b, but they have different fixed offsets to the bias (b-0.5, b-1.5, ...). ==== Structure ==== Autoencoder, then feed forward NN