data_mining:neural_network:belief_nets

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
data_mining:neural_network:belief_nets [2017/04/29 10:24] phreazerdata_mining:neural_network:belief_nets [2017/07/30 16:05] (current) – [Structure] phreazer
Line 56: Line 56:
  
 ===== Learning layers of features by stacking RBMs ===== ===== Learning layers of features by stacking RBMs =====
 +
 +RBM can be learned fairly efficient. Stacking RBMs can learn lots of features. By stacking you get an indefinitely deep belief net.
 +
 +Combine 2 RBMs to make a DBN. 
 +
 +v <=W_1=> h_1 copy binary state for each v: h_1 <=W_2=>h_2.
 +
 +==== Compose two RBM models ====
 +
 +v <=W_1= h_1 <=W_2=> h_2
 +
 +Bottom layer is unidirectional. => Not a Boltzman machine. It's a deep belief net.
 +
 +==== 3 layers ====
 +
 +data <=W_1= h_1 <=W_2= h_2 <=W_3=> h_3
 +
 +Generate data:
 +  * Equilibrium sample from **top-level** RBM (h_2,h_3), by Gibbs sampling for a long time. Defines prior distr. of h_2.
 +  * Top-Down pass from h_2 to get states for other layers.
 +
 +=== Averaging factorial distributions ===
 +
 +By averaging 2 fact. distr. you don't get a fact. distr. => Mixture distr.
 +
 +In RBM the posterior over 4 hidden units is factorial for each visible vector
 +  * Posterior for v1: 0.9, 0.9, 0.1, 0.1
 +  * Posterior for v2: 0.1, 0.1, 0.9, 0.9
 +  * Aggregated: 0.5, 0.5, 0.5, 0.5
 +
 +Consider binary vec (1,1,0,0)
 +  * Posterior for v1 p(1,1,0,0) = 0.9^4 = 0.43
 +  * Posterior for v2 p(1,1,0,0) = 0.1^4 = 0.0001
 +  * Aggregated posterior p(1,1,0,0) = 0.215
 +      * Factorial would be p = 0.5^4
 +
 +
 +==== Why does learning work? ====
 +
 +Weights of bottom level RBM:
 +p(v|h); p(h|v); p(v,h); p(v); p(h);
 +
 +Can express RBM model as $p(v) = \sum_h p(h) p(v|h)$
 +
 +If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a **better model than $p(h;W)$** of the **aggregated** posterior distr. over hidden vectors produced by applying W transposed to the data.
 +
 +
 +==== Constrastive version of wake-sleep algorithm ====
 +
 +
 +==== Discriminative fine-tuning for DBNs ====
 +
 +  * First learn one layer at a time by stacking RBMs.
 +  * Use this pre-training (found initial weights), which can be fine-tuned by a local search procedure.
 +
 +  * Perviously: Constrastive wake-sleep fine-tuning the model to be better at generation.
 +  * Now: Use Backprop to fine-tune the model to be better at discrimination.
 +
 +
 +Backprop works better with greedy pre-training:
 +* Works wll ans scales to big networks, esp. when we have locality in each layer.
 +* We do not start backpropagation until we have sensible feature detectors.
 +    * Initial gradients are sensibel, backprop only needs to perform a local search from a sensible start point.
 +
 +Fine-tuning only modifies features slightly to get category boundaries right (does not need to discover new features).
 +
 +Objection: Many features are learned that are useless for a particular discrimination.
 +
 +Example model (MNIST): Add 10-way softmax at the top and do backprop.
 +
 +
 +More layers => lower error with pretraining.
 +
 +Solutions are qualitative different.
 +
 +==== Model real-valued data with RBMS ====
 +
 +Mean-field logistic units cannot represent precise inetermediate values (e.g. pixel intensity in image).
 +
 +Model pixels as Gaussian variables. Alternating Gibbs sampling, with lower learning rate.
 +
 +Parabolic containment function. (keep visible unit close to b_i).
 +Energy-gradient.
 +
 +Stepped sigmoid units. Many copies of a stochastic binary unist. All copies have same weiths and bias, b, but they have different fixed offsets to the bias (b-0.5, b-1.5, ...).
 +
 +==== Structure ====
 +
 +Autoencoder, then feed forward NN
  
  • data_mining/neural_network/belief_nets.1493461459.txt.gz
  • Last modified: 2017/04/29 10:24
  • by phreazer