data_mining:neural_network:sequences:sequence_learning

Sequence learning

  • Often want to turn input sequence into an output sequence of a different domain
  • When no target sequence available, predict next term of input sequence
  • Mix between supervised and unsupervised learning (no additional target var required)

Simple model without memory:

  • Autoregressive model (Weighted average of previous terms (linear model))
  • Feed forward with a hidden layer

Introduce hidden state * Best we can do is infer probability distribution over space of hidden state vectors.

Hidden state:

  • Real-values hidden state that cannot be observed directly.
  • Hidden state has linear dynaimcs with Gaussian noise and produces observations using a linear model with gaussian noise.

Optional: Driving inputs which directly hidden state

 time ->
   o   o   o
   |   |   |
 > h > h > h
   |   |   |
   di  di  di

To predict next output, we need to infer hidden state.

Linear transformed gaussian ⇒ gaussian. Distribution over hidden state given data so far is gaussian. Can be computed using “Kalman filtering”.

Estimation of gaussian distribution possible.

http://wiki.movedesign.de/doku.php?id=data_mining:hmm

  • Discrete distribution
  • Transitions between states is stochastic
  • Infer probability distribution over hidden states

Limitation: At each time step, it selects one of it's hidden states. N hidden states → it can only remeber log(N) bits about what it generated so far.

Consider the information that the first half of an utterance contians about second half:

  • Syntax needs to fit (number, tense)
  • Semantics needs to fit
  • Accent, rate, volume has to fit

E.g. if 100 bits are necessary, 2^100 hidden states would be necessary.

  • Distributed hidden state: Store information about past efficiently
  • Non-linear dynamics, allows update their hidden state in complicated ways.
  • RNN combine input vectors with their state vector using a fixed function to create a new state vector.
  • RNNs are Turing-Complete. Able to learn stateful programs, which process a fixed data set.
  • Output vector depends not only on input, but on complete input history.

Example applications:

  • time series forecasting
  • image captioning (assign words to elements)
  • music composing (midi files)
  • language modelling (e.g. predict next word)

Derived models:

  • Recursive NNs
  • Recursive Neural Tensor Networks
  • Hopfield Networks
  • Echo State Networks

Problems:

  • Sensitive to parameter changes
  • Vanishing / Exploding gradient problem
  • Remembering states at all time is expensive

One solution for mentioned problems.

4 Elements:

  • information/memory cell: hold information
  • keep gate: maintains or deletes data from information cell; calculates how much auf the data should be remembered; Input: Input data + state;
  • write gate: Input: Input data + state + RNN output from recent time step; How much of the output data should be written in memory cell
  • read gate: Read data from information cell (signal (-1;1)); Send data from LSTM to RNN network; Input: Input data + state; How much of signal should be send to RNN
  • Data input and state is directly connected to all gates.
  • Network output (RNN output) is connected to write gate
  • Read gate is connected to networks processing cell (LSTM output)
  • Write gate sends output to information cell, information cell send data to read gate.

Gates are logistic functions (nice derivatives).

Perplexity measure: Target: Low perplexity, high confidence. Data sets: Penn Treebank

Word embedding: n-dimensional vector of real numbers (n > 100). Word used in similar contexts end up with similar positions in vector space. Can be visualized with t-SNE (dim red).

  • data_mining/neural_network/sequences/sequence_learning.txt
  • Last modified: 2019/10/26 10:09
  • by phreazer