data_mining:neural_network:belief_nets

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:belief_nets [2017/04/22 17:06] – [Wake-sleep algorithm] phreazerdata_mining:neural_network:belief_nets [2017/07/30 16:05] (current) – [Structure] phreazer
Line 47: Line 47:
  
 Wake phase:  Wake phase: 
-  * Recognition weights R to perform bottom-up pass.+  * Recognition weights R to perform bottom-up pass. Train generative weights to reconstruct activities in each layer from layer above. 
 +Sleep phase: 
 +  * Use generative weights to generate samples from the models. Train recognition weights to reconstruct activities in each layer from the layer below. 
 + 
 +Problems: ... 
 + 
 +Mode averaging: 
 + 
 +===== Learning layers of features by stacking RBMs ===== 
 + 
 +RBM can be learned fairly efficient. Stacking RBMs can learn lots of features. By stacking you get an indefinitely deep belief net. 
 + 
 +Combine 2 RBMs to make a DBN.  
 + 
 +v <=W_1=> h_1 copy binary state for each v: h_1 <=W_2=>h_2. 
 + 
 +==== Compose two RBM models ==== 
 + 
 +v <=W_1= h_1 <=W_2=> h_2 
 + 
 +Bottom layer is unidirectional. => Not a Boltzman machine. It's a deep belief net. 
 + 
 +==== 3 layers ==== 
 + 
 +data <=W_1= h_1 <=W_2= h_2 <=W_3=> h_3 
 + 
 +Generate data: 
 +  * Equilibrium sample from **top-level** RBM (h_2,h_3), by Gibbs sampling for a long time. Defines prior distr. of h_2. 
 +  * Top-Down pass from h_2 to get states for other layers. 
 + 
 +=== Averaging factorial distributions === 
 + 
 +By averaging 2 fact. distr. you don't get a fact. distr. => Mixture distr. 
 + 
 +In RBM the posterior over 4 hidden units is factorial for each visible vector 
 +  * Posterior for v1: 0.9, 0.9, 0.1, 0.1 
 +  * Posterior for v2: 0.1, 0.1, 0.9, 0.9 
 +  * Aggregated: 0.5, 0.5, 0.5, 0.5 
 + 
 +Consider binary vec (1,1,0,0) 
 +  * Posterior for v1 p(1,1,0,0) = 0.9^4 = 0.43 
 +  * Posterior for v2 p(1,1,0,0) = 0.1^4 = 0.0001 
 +  * Aggregated posterior p(1,1,0,0) = 0.215 
 +      * Factorial would be p = 0.5^4 
 + 
 + 
 +==== Why does learning work? ==== 
 + 
 +Weights of bottom level RBM: 
 +p(v|h); p(h|v); p(v,h); p(v); p(h); 
 + 
 +Can express RBM model as $p(v) = \sum_h p(h) p(v|h)$ 
 + 
 +If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a **better model than $p(h;W)$** of the **aggregated** posterior distr. over hidden vectors produced by applying W transposed to the data. 
 + 
 + 
 +==== Constrastive version of wake-sleep algorithm ==== 
 + 
 + 
 +==== Discriminative fine-tuning for DBNs ==== 
 + 
 +  * First learn one layer at a time by stacking RBMs. 
 +  * Use this pre-training (found initial weights), which can be fine-tuned by a local search procedure. 
 + 
 +  * Perviously: Constrastive wake-sleep fine-tuning the model to be better at generation. 
 +  * Now: Use Backprop to fine-tune the model to be better at discrimination. 
 + 
 + 
 +Backprop works better with greedy pre-training: 
 +* Works wll ans scales to big networks, esp. when we have locality in each layer. 
 +* We do not start backpropagation until we have sensible feature detectors. 
 +    * Initial gradients are sensibel, backprop only needs to perform a local search from a sensible start point. 
 + 
 +Fine-tuning only modifies features slightly to get category boundaries right (does not need to discover new features). 
 + 
 +Objection: Many features are learned that are useless for a particular discrimination. 
 + 
 +Example model (MNIST): Add 10-way softmax at the top and do backprop. 
 + 
 + 
 +More layers => lower error with pretraining. 
 + 
 +Solutions are qualitative different. 
 + 
 +==== Model real-valued data with RBMS ==== 
 + 
 +Mean-field logistic units cannot represent precise inetermediate values (e.g. pixel intensity in image). 
 + 
 +Model pixels as Gaussian variables. Alternating Gibbs sampling, with lower learning rate. 
 + 
 +Parabolic containment function. (keep visible unit close to b_i). 
 +Energy-gradient. 
 + 
 +Stepped sigmoid units. Many copies of a stochastic binary unist. All copies have same weiths and bias, b, but they have different fixed offsets to the bias (b-0.5, b-1.5, ...). 
 + 
 +==== Structure ==== 
 + 
 +Autoencoder, then feed forward NN 
  • data_mining/neural_network/belief_nets.1492880816.txt.gz
  • Last modified: 2017/04/22 17:06
  • by phreazer