Sequence learning
- Often want to turn input sequence into an output sequence of a different domain
- When no target sequence available, predict next term of input sequence
- Mix between supervised and unsupervised learning (no additional target var required)
Simple model without memory:
- Autoregressive model (Weighted average of previous terms (linear model))
- Feed forward with a hidden layer
Introduce hidden state * Best we can do is infer probability distribution over space of hidden state vectors.
Linear dynamical system
Hidden state:
- Real-values hidden state that cannot be observed directly.
- Hidden state has linear dynaimcs with Gaussian noise and produces observations using a linear model with gaussian noise.
Optional: Driving inputs which directly hidden state
time -> o o o | | | > h > h > h | | | di di di
To predict next output, we need to infer hidden state.
Linear transformed gaussian ⇒ gaussian. Distribution over hidden state given data so far is gaussian. Can be computed using “Kalman filtering”.
Estimation of gaussian distribution possible.
Hidden Markov Models
http://wiki.movedesign.de/doku.php?id=data_mining:hmm
- Discrete distribution
- Transitions between states is stochastic
- Infer probability distribution over hidden states
Limitation: At each time step, it selects one of it's hidden states. N hidden states → it can only remeber log(N) bits about what it generated so far.
Consider the information that the first half of an utterance contians about second half:
- Syntax needs to fit (number, tense)
- Semantics needs to fit
- Accent, rate, volume has to fit
E.g. if 100 bits are necessary, 2^100 hidden states would be necessary.
Recurrent neural networks
- Distributed hidden state: Store information about past efficiently
- Non-linear dynamics, allows update their hidden state in complicated ways.
- RNN combine input vectors with their state vector using a fixed function to create a new state vector.
- RNNs are Turing-Complete. Able to learn stateful programs, which process a fixed data set.
- Output vector depends not only on input, but on complete input history.
Example applications:
- time series forecasting
- image captioning (assign words to elements)
- music composing (midi files)
- language modelling (e.g. predict next word)
Derived models:
- Recursive NNs
- Recursive Neural Tensor Networks
- Hopfield Networks
- Echo State Networks
Problems:
- Sensitive to parameter changes
- Vanishing / Exploding gradient problem
- Remembering states at all time is expensive
Long Short-Term Memory (LSTM) model
One solution for mentioned problems.
4 Elements:
- information/memory cell: hold information
- keep gate: maintains or deletes data from information cell; calculates how much auf the data should be remembered; Input: Input data + state;
- write gate: Input: Input data + state + RNN output from recent time step; How much of the output data should be written in memory cell
- read gate: Read data from information cell (signal (-1;1)); Send data from LSTM to RNN network; Input: Input data + state; How much of signal should be send to RNN
- Data input and state is directly connected to all gates.
- Network output (RNN output) is connected to write gate
- Read gate is connected to networks processing cell (LSTM output)
- Write gate sends output to information cell, information cell send data to read gate.
Gates are logistic functions (nice derivatives).
Language modelling
Perplexity measure: Target: Low perplexity, high confidence. Data sets: Penn Treebank
Word embedding: n-dimensional vector of real numbers (n > 100). Word used in similar contexts end up with similar positions in vector space. Can be visualized with t-SNE (dim red).