Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:autoencoder [2017/05/04 13:19] – [Deep autoencoders] phreazer | data_mining:neural_network:autoencoder [2017/07/30 16:02] (current) – [Autoencoder] phreazer | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Autoencoder ====== | ====== Autoencoder ====== | ||
+ | |||
+ | * Unsupervised learning: Feature extraction, Generative models, Compression, | ||
+ | * Loss as evaluation metric | ||
+ | * Difference to RBM: Deterministic approach (not stochastic). | ||
+ | * Encoder compresses to few dimensions, Decoder maps back to full dimensionality | ||
+ | * Building block for deep belief networks | ||
+ | ===== Comparison with PCA ===== | ||
+ | |||
PCA: | PCA: | ||
Line 21: | Line 29: | ||
===== Deep autoencoders ===== | ===== Deep autoencoders ===== | ||
+ | |||
Looked like nice way to do non-linear dimensionality reduction: | Looked like nice way to do non-linear dimensionality reduction: | ||
* Encoding model compact and fast; | * Encoding model compact and fast; | ||
Line 34: | Line 43: | ||
W_1 -> W_2 -> W_3 -> W_4 -> 30 linear units -> W^T_4 -> W^T_3 -> W^T_2 -> W^T_1 | W_1 -> W_2 -> W_3 -> W_4 -> 30 linear units -> W^T_4 -> W^T_3 -> W^T_2 -> W^T_1 | ||
- | ===== Deep autoencoders for doc retrieval | + | ===== Deep autoencoders for doc retrieval ===== |
- | ===== | + | |
Convert each doc into "bag of words" (ignore stop words) | Convert each doc into "bag of words" (ignore stop words) | ||
Reduce each query vector using a deep autoencoder. | Reduce each query vector using a deep autoencoder. | ||
Line 44: | Line 52: | ||
When training the first RBM in the stack: Treat word counts as probs, but make visible to hidden weights N times bigger than hidden to visible, because we have N obersvations from the prop distr. | When training the first RBM in the stack: Treat word counts as probs, but make visible to hidden weights N times bigger than hidden to visible, because we have N obersvations from the prop distr. | ||
+ | |||
+ | ===== Semantic hashing ===== | ||
+ | |||
+ | Convert doc in memory address. Find similar docs in nearby addresses. | ||
+ | |||
+ | Autoencoder 30 logistic units in code layer | ||
+ | During fine-tuning add noise to inputs to the code units. | ||
+ | * Noise forces activities to become bimodal in order to resist the effects of the noise. | ||
+ | * Simply threshold activities of 30 code units to get binary code. | ||
+ | |||
+ | Learn binary features for representation. | ||
+ | |||
+ | Deep-autoencoder as hash function. | ||
+ | |||
+ | Query (supermaket search): Hash, get address, get nearby addresses (semantically similar documents). | ||
+ | |||
+ | ===== Learn binary codes for image retrieval ===== | ||
+ | |||
+ | Matching real-values vectors is slow => short binary code faster. | ||
+ | |||
+ | Use semantic hashing with 28-bit binary code to get a long shortlist of promising images. The nuser 265 bit binary code to do a serial search for good matches. | ||
+ | |||
+ | Krizhevsky' | ||
+ | |||
+ | Reconstructing 32x32 color images from 256 bit codes. | ||
+ | |||
+ | ===== Shallow autoencoders for pre-training ===== | ||
+ | |||
+ | Just have 1 layer. RBMs can be seen as shallow autoencoders. | ||
+ | |||
+ | Train RBM with one-step constrastive divergence: Makses resconstruction look like data. | ||
+ | |||
+ | |||
+ | ===== Conclusion about pre-training ===== | ||
+ | |||
+ | For data sets without huge number of labeled cases: Pre-training helps subsequent discriminative learning, espescially if unlabeled extra data is available. | ||
+ | |||
+ | For very large, labeled datasets: Not necessary, but if nets get much larger pre-training is necessary again. |