Sequence to sequence models

  • Encoder network (RNN): Output: encoding
  • Decoder network: Input: encoded output

Image captioning:

  • Encoder network: Learn encoding of an image
  • Feed encoding to RNN and generate sequence

Language Model: $P(y^1, ..., y^{Ty})$

Machine translation: Has encoder network as input

“Conditional language model” $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

P(english | french)

arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

Solution: Beam Search

Jane is visiting … Jane is going …

P(Jane is going|x) > P(Jane is visiting | x)

Going more probable as follow up word, but sentence isn't better

Most likely output to be searched.

Step 1:

Vocabulary of 10000 words

$P(y^1 | x)$

Encoder ⇒ Output $\hat{y}^1$

Parameter Beam width B = 3

Keep track of B most likely words

Step 2:

For B most likely choices:

Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$

$P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$

Evaluate all 10000 options for each likely word

Remember only most likely 3 choices for first and second word in each step

$B=1$ ⇒ Greedy

Length normalization:

  • Probabilities are near zero
  • Take logs to make it more numerical stable

Long sentences are more unlikely, because of more multiplications

  • Normalize with $1/T_y^\alpha$

How to choose B?

  • Large: Better result, but slower
  • Small: Worse result, but faster
  • Production: 10 - 100
  • Research: 1000 - 3000

Beam search is not guaranteed to find exact maximum, unlike BFS, DFS

Attribute error to RNN or Beam search?

RNN computes P(y|x)

Compute P(good translation|x)

Compute P(algo translation|x)

  • Case 1: good trans has higher val: Beam search chose algo trans ⇒ Beam search problem
  • Case 2: algo trans has higher val: RNN at fault

Multiple similar good translations ⇒ Compute score how good translation is.

Work part by part through sentence, instead of remember complete long sentences.

  • Bidirectional RNN for input
  • Attention weights as input through context for second RNN
