Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:belief_nets [2017/04/29 10:24] – phreazer | data_mining:neural_network:belief_nets [2017/07/30 16:05] (current) – [Structure] phreazer | ||
---|---|---|---|
Line 56: | Line 56: | ||
===== Learning layers of features by stacking RBMs ===== | ===== Learning layers of features by stacking RBMs ===== | ||
+ | |||
+ | RBM can be learned fairly efficient. Stacking RBMs can learn lots of features. By stacking you get an indefinitely deep belief net. | ||
+ | |||
+ | Combine 2 RBMs to make a DBN. | ||
+ | |||
+ | v < | ||
+ | |||
+ | ==== Compose two RBM models ==== | ||
+ | |||
+ | v <=W_1= h_1 < | ||
+ | |||
+ | Bottom layer is unidirectional. => Not a Boltzman machine. It's a deep belief net. | ||
+ | |||
+ | ==== 3 layers ==== | ||
+ | |||
+ | data <=W_1= h_1 <=W_2= h_2 < | ||
+ | |||
+ | Generate data: | ||
+ | * Equilibrium sample from **top-level** RBM (h_2,h_3), by Gibbs sampling for a long time. Defines prior distr. of h_2. | ||
+ | * Top-Down pass from h_2 to get states for other layers. | ||
+ | |||
+ | === Averaging factorial distributions === | ||
+ | |||
+ | By averaging 2 fact. distr. you don't get a fact. distr. => Mixture distr. | ||
+ | |||
+ | In RBM the posterior over 4 hidden units is factorial for each visible vector | ||
+ | * Posterior for v1: 0.9, 0.9, 0.1, 0.1 | ||
+ | * Posterior for v2: 0.1, 0.1, 0.9, 0.9 | ||
+ | * Aggregated: 0.5, 0.5, 0.5, 0.5 | ||
+ | |||
+ | Consider binary vec (1,1,0,0) | ||
+ | * Posterior for v1 p(1,1,0,0) = 0.9^4 = 0.43 | ||
+ | * Posterior for v2 p(1,1,0,0) = 0.1^4 = 0.0001 | ||
+ | * Aggregated posterior p(1,1,0,0) = 0.215 | ||
+ | * Factorial would be p = 0.5^4 | ||
+ | |||
+ | |||
+ | ==== Why does learning work? ==== | ||
+ | |||
+ | Weights of bottom level RBM: | ||
+ | p(v|h); p(h|v); p(v,h); p(v); p(h); | ||
+ | |||
+ | Can express RBM model as $p(v) = \sum_h p(h) p(v|h)$ | ||
+ | |||
+ | If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a **better model than $p(h;W)$** of the **aggregated** posterior distr. over hidden vectors produced by applying W transposed to the data. | ||
+ | |||
+ | |||
+ | ==== Constrastive version of wake-sleep algorithm ==== | ||
+ | |||
+ | |||
+ | ==== Discriminative fine-tuning for DBNs ==== | ||
+ | |||
+ | * First learn one layer at a time by stacking RBMs. | ||
+ | * Use this pre-training (found initial weights), which can be fine-tuned by a local search procedure. | ||
+ | |||
+ | * Perviously: Constrastive wake-sleep fine-tuning the model to be better at generation. | ||
+ | * Now: Use Backprop to fine-tune the model to be better at discrimination. | ||
+ | |||
+ | |||
+ | Backprop works better with greedy pre-training: | ||
+ | * Works wll ans scales to big networks, esp. when we have locality in each layer. | ||
+ | * We do not start backpropagation until we have sensible feature detectors. | ||
+ | * Initial gradients are sensibel, backprop only needs to perform a local search from a sensible start point. | ||
+ | |||
+ | Fine-tuning only modifies features slightly to get category boundaries right (does not need to discover new features). | ||
+ | |||
+ | Objection: Many features are learned that are useless for a particular discrimination. | ||
+ | |||
+ | Example model (MNIST): Add 10-way softmax at the top and do backprop. | ||
+ | |||
+ | |||
+ | More layers => lower error with pretraining. | ||
+ | |||
+ | Solutions are qualitative different. | ||
+ | |||
+ | ==== Model real-valued data with RBMS ==== | ||
+ | |||
+ | Mean-field logistic units cannot represent precise inetermediate values (e.g. pixel intensity in image). | ||
+ | |||
+ | Model pixels as Gaussian variables. Alternating Gibbs sampling, with lower learning rate. | ||
+ | |||
+ | Parabolic containment function. (keep visible unit close to b_i). | ||
+ | Energy-gradient. | ||
+ | |||
+ | Stepped sigmoid units. Many copies of a stochastic binary unist. All copies have same weiths and bias, b, but they have different fixed offsets to the bias (b-0.5, b-1.5, ...). | ||
+ | |||
+ | ==== Structure ==== | ||
+ | |||
+ | Autoencoder, | ||