Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:belief_nets [2017/04/29 10:46] – [3 layers] phreazer | data_mining:neural_network:belief_nets [2017/07/30 16:05] (current) – [Structure] phreazer | ||
---|---|---|---|
Line 101: | Line 101: | ||
If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a **better model than $p(h;W)$** of the **aggregated** posterior distr. over hidden vectors produced by applying W transposed to the data. | If leave $p(v|h)$ alon, but improve $p(h)$, we improve $p(v)$. To improve $p(h)$ we need it be a **better model than $p(h;W)$** of the **aggregated** posterior distr. over hidden vectors produced by applying W transposed to the data. | ||
+ | |||
+ | |||
+ | ==== Constrastive version of wake-sleep algorithm ==== | ||
+ | |||
+ | |||
+ | ==== Discriminative fine-tuning for DBNs ==== | ||
+ | |||
+ | * First learn one layer at a time by stacking RBMs. | ||
+ | * Use this pre-training (found initial weights), which can be fine-tuned by a local search procedure. | ||
+ | |||
+ | * Perviously: Constrastive wake-sleep fine-tuning the model to be better at generation. | ||
+ | * Now: Use Backprop to fine-tune the model to be better at discrimination. | ||
+ | |||
+ | |||
+ | Backprop works better with greedy pre-training: | ||
+ | * Works wll ans scales to big networks, esp. when we have locality in each layer. | ||
+ | * We do not start backpropagation until we have sensible feature detectors. | ||
+ | * Initial gradients are sensibel, backprop only needs to perform a local search from a sensible start point. | ||
+ | |||
+ | Fine-tuning only modifies features slightly to get category boundaries right (does not need to discover new features). | ||
+ | |||
+ | Objection: Many features are learned that are useless for a particular discrimination. | ||
+ | |||
+ | Example model (MNIST): Add 10-way softmax at the top and do backprop. | ||
+ | |||
+ | |||
+ | More layers => lower error with pretraining. | ||
+ | |||
+ | Solutions are qualitative different. | ||
+ | |||
+ | ==== Model real-valued data with RBMS ==== | ||
+ | |||
+ | Mean-field logistic units cannot represent precise inetermediate values (e.g. pixel intensity in image). | ||
+ | |||
+ | Model pixels as Gaussian variables. Alternating Gibbs sampling, with lower learning rate. | ||
+ | |||
+ | Parabolic containment function. (keep visible unit close to b_i). | ||
+ | Energy-gradient. | ||
+ | |||
+ | Stepped sigmoid units. Many copies of a stochastic binary unist. All copies have same weiths and bias, b, but they have different fixed offsets to the bias (b-0.5, b-1.5, ...). | ||
+ | |||
+ | ==== Structure ==== | ||
+ | |||
+ | Autoencoder, | ||