Sequence learning
Often want to turn input sequence into an output sequence of a different domain
When no target sequence available, predict next term of input sequence
Mix between supervised and unsupervised learning (no additional target var required)
Simple model without memory:
Introduce hidden state
* Best we can do is infer probability distribution over space of hidden state vectors.
Linear dynamical system
Hidden state:
Optional: Driving inputs which directly hidden state
time ->
o o o
| | |
> h > h > h
| | |
di di di
To predict next output, we need to infer hidden state.
Linear transformed gaussian ⇒ gaussian. Distribution over hidden state given data so far is gaussian. Can be computed using “Kalman filtering”.
Estimation of gaussian distribution possible.
Hidden Markov Models
http://wiki.movedesign.de/doku.php?id=data_mining:hmm
Limitation: At each time step, it selects one of it's hidden states. N hidden states → it can only remeber log(N) bits about what it generated so far.
Consider the information that the first half of an utterance contians about second half:
Syntax needs to fit (number, tense)
Semantics needs to fit
Accent, rate, volume has to fit
E.g. if 100 bits are necessary, 2^100 hidden states would be necessary.
Recurrent neural networks
Distributed hidden state: Store information about past efficiently
Non-linear dynamics, allows update their hidden state in complicated ways.
RNN combine input vectors with their state vector using a fixed function to create a new state vector.
RNNs are Turing-Complete. Able to learn stateful programs, which process a fixed data set.
Output vector depends not only on input, but on complete input history.
Example applications:
time series forecasting
image captioning (assign words to elements)
music composing (midi files)
language modelling (e.g. predict next word)
Derived models:
Problems:
Sensitive to parameter changes
Vanishing / Exploding gradient problem
Remembering states at all time is expensive
Long Short-Term Memory (LSTM) model
One solution for mentioned problems.
4 Elements:
information/memory cell: hold information
keep gate: maintains or deletes data from information cell; calculates how much auf the data should be remembered; Input: Input data + state;
write gate: Input: Input data + state + RNN output from recent time step; How much of the output data should be written in memory cell
read gate: Read data from information cell (signal (-1;1)); Send data from LSTM to RNN network; Input: Input data + state; How much of signal should be send to RNN
Data input and state is directly connected to all gates.
Network output (RNN output) is connected to write gate
Read gate is connected to networks processing cell (LSTM output)
Write gate sends output to information cell, information cell send data to read gate.
Gates are logistic functions (nice derivatives).
Language modelling
Perplexity measure: Target: Low perplexity, high confidence.
Data sets: Penn Treebank
Word embedding: n-dimensional vector of real numbers (n > 100). Word used in similar contexts end up with similar positions in vector space. Can be visualized with t-SNE (dim red).
Sources