Autoencoder
- Unsupervised learning: Feature extraction, Generative models, Compression, Data reduction
- Loss as evaluation metric
- Difference to RBM: Deterministic approach (not stochastic).
- Encoder compresses to few dimensions, Decoder maps back to full dimensionality
- Building block for deep belief networks
Comparison with PCA
PCA:
- N dimensions, M orthogonal directions with most variance.
- Reconstruct by using mean value over all the data on the N-M directions that are not represented.
USe backprop to implement PCA inefficiently.
- M hidden units as bottleneck
INPUT vector ⇒ Code ⇒ OUTPUT vector
Activities in the hidden units form an efficient code.
If hidden and output layers are linear, it weill learn hidden units that are line function of data and minimize squared reconstruciton error (like PCA). M hidden units will span same space as the first M components of PCA, but weights vectors may not be orthogonal and will have equal variances.
Allows generalization of PCA.
With non-linear layers before and after the code, it should be possible to efficiently represent data that lies on or near a non-linear manifold.
input vector ⇒ encoder ⇒ encode weights ⇒ decoding weights ⇒ output vector
Deep autoencoders
Looked like nice way to do non-linear dimensionality reduction:
- Encoding model compact and fast;
- learning time is linear in the number of training cases.
But very difficult to optimize deep autoencoders using backprop: Small initial weights ⇒ backprop gradient dies.
Optimize with: unsupervised layer-by-layer pre-training. Or initialize weights carefully as in Echo-State nets.
Stack of 4 RBMs, then unroll them. Fine-tune with gentle backprop.
784 → 1000 → 500 → 250 → 30 linear models → 250 → 500 → 1000 → 784. W_1 → W_2 → W_3 → W_4 → 30 linear units → W^T_4 → W^T_3 → W^T_2 → W^T_1
Deep autoencoders for doc retrieval
Convert each doc into “bag of words” (ignore stop words) Reduce each query vector using a deep autoencoder.
Input vec 2000 counts ⇒ Output vec 2000 reconstructed counts.
Divide counts in a bag of words vector by N, where N is the total number of non-stop works in the document. Output of autoencoder 2000 softmax.
When training the first RBM in the stack: Treat word counts as probs, but make visible to hidden weights N times bigger than hidden to visible, because we have N obersvations from the prop distr.
Semantic hashing
Convert doc in memory address. Find similar docs in nearby addresses.
Autoencoder 30 logistic units in code layer During fine-tuning add noise to inputs to the code units.
- Noise forces activities to become bimodal in order to resist the effects of the noise.
- Simply threshold activities of 30 code units to get binary code.
Learn binary features for representation.
Deep-autoencoder as hash function.
Query (supermaket search): Hash, get address, get nearby addresses (semantically similar documents).
Learn binary codes for image retrieval
Matching real-values vectors is slow ⇒ short binary code faster.
Use semantic hashing with 28-bit binary code to get a long shortlist of promising images. The nuser 265 bit binary code to do a serial search for good matches.
Krizhevsky's deep autoencoder 8192 ⇒ 4096 ⇒ … ⇒ 256-bit binary code. (architecture is just a guess).
Reconstructing 32×32 color images from 256 bit codes.
Shallow autoencoders for pre-training
Just have 1 layer. RBMs can be seen as shallow autoencoders.
Train RBM with one-step constrastive divergence: Makses resconstruction look like data.
Conclusion about pre-training
For data sets without huge number of labeled cases: Pre-training helps subsequent discriminative learning, espescially if unlabeled extra data is available.
For very large, labeled datasets: Not necessary, but if nets get much larger pre-training is necessary again.