Sequence to sequence models
- Picking most likely sentence

Sequence to sequence models

Encoder network (RNN): Output: encoding
Decoder network: Input: encoded output

Image captioning:

Encoder network: Learn encoding of an image
Feed encoding to RNN and generate sequence

Picking most likely sentence

Language Model: $P(y^1, ..., y^{Ty})$

Machine translation: Has encoder network as input

“Conditional language model” $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

P(english | french)

arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

Solution: Beam Search

Why not greedy?

Jane is visiting … Jane is going …

P(Jane is going|x) > P(Jane is visiting | x)

Going more probable as follow up word, but sentence isn't better

Beam search

Most likely output to be searched.

Step 1:

Vocabulary of 10000 words

$P(y^1 | x)$

Encoder ⇒ Output $\hat{y}^1$

Parameter Beam width B = 3

Keep track of B most likely words

Step 2:

For B most likely choices:

Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$

$P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$

Evaluate all 10000 options for each likely word

Remember only most likely 3 choices for first and second word in each step

$B=1$ ⇒ Greedy

Beam search refinement

Length normalization:

Probabilities are near zero
Take logs to make it more numerical stable

Long sentences are more unlikely, because of more multiplications

Normalize with $1/T_y^\alpha$

How to choose B?

Large: Better result, but slower
Small: Worse result, but faster
Production: 10 - 100
Research: 1000 - 3000

Beam search is not guaranteed to find exact maximum, unlike BFS, DFS

Error analysis in beam search

Attribute error to RNN or Beam search?

RNN computes P(y|x)

Compute P(good translation|x)

Compute P(algo translation|x)

Case 1: good trans has higher val: Beam search chose algo trans ⇒ Beam search problem
Case 2: algo trans has higher val: RNN at fault

Bleu score

Multiple similar good translations ⇒ Compute score how good translation is.

Attention model

Work part by part through sentence, instead of remember complete long sentences.

Bidirectional RNN for input
Attention weights as input through context for second RNN

Table of Contents