====== Sequence learning ====== * Often want to turn input sequence into an output sequence of a different domain * When no target sequence available, predict next term of input sequence * Mix between supervised and unsupervised learning (no additional target var required) Simple model without memory: * Autoregressive model (Weighted average of previous terms (linear model)) * Feed forward with a hidden layer Introduce hidden state * Best we can do is infer probability distribution over space of hidden state vectors. ===== Linear dynamical system ===== Hidden state: * Real-values hidden state that cannot be observed directly. * Hidden state has linear dynaimcs with Gaussian noise and produces observations using a linear model with gaussian noise. Optional: Driving inputs which directly hidden state time -> o o o | | | > h > h > h | | | di di di To predict next output, we need to infer hidden state. Linear transformed gaussian => gaussian. Distribution over hidden state given data so far is gaussian. Can be computed using "Kalman filtering". Estimation of gaussian distribution possible. ===== Hidden Markov Models ===== http://wiki.movedesign.de/doku.php?id=data_mining:hmm * Discrete distribution * Transitions between states is stochastic * Infer probability distribution over hidden states Limitation: At each time step, it selects one of it's hidden states. N hidden states -> it can only remeber log(N) bits about what it generated so far. Consider the information that the first half of an utterance contians about second half: * Syntax needs to fit (number, tense) * Semantics needs to fit * Accent, rate, volume has to fit E.g. if 100 bits are necessary, 2^100 hidden states would be necessary. ===== Recurrent neural networks ===== * Distributed hidden state: Store information about past efficiently * Non-linear dynamics, allows update their hidden state in complicated ways. * RNN combine input vectors with their state vector using a fixed function to create a new state vector. * RNNs are Turing-Complete. Able to learn stateful programs, which process a fixed data set. * Output vector depends not only on input, but on complete input history. Example applications: * time series forecasting * image captioning (assign words to elements) * music composing (midi files) * language modelling (e.g. predict next word) Derived models: * Recursive NNs * Recursive Neural Tensor Networks * Hopfield Networks * Echo State Networks Problems: * Sensitive to parameter changes * Vanishing / Exploding gradient problem * Remembering states at all time is expensive ===== Long Short-Term Memory (LSTM) model ===== One solution for mentioned problems. 4 Elements: * information/memory cell: hold information * keep gate: maintains or deletes data from information cell; calculates how much auf the data should be remembered; Input: Input data + state; * write gate: Input: Input data + state + RNN output from recent time step; How much of the output data should be written in memory cell * read gate: Read data from information cell (signal (-1;1)); Send data from LSTM to RNN network; Input: Input data + state; How much of signal should be send to RNN * Data input and state is directly connected to all gates. * Network output (RNN output) is connected to write gate * Read gate is connected to networks processing cell (LSTM output) * Write gate sends output to information cell, information cell send data to read gate. Gates are logistic functions (nice derivatives). ==== Language modelling ==== Perplexity measure: Target: Low perplexity, high confidence. Data sets: Penn Treebank Word embedding: n-dimensional vector of real numbers (n > 100). Word used in similar contexts end up with similar positions in vector space. Can be visualized with t-SNE (dim red). ===== Sources ===== * http://karpathy.github.io/2015/05/21/rnn-effectiveness/