Sequence to sequence models

Encoder network (RNN): Output: encoding
Decoder network: Input: encoded output

Image captioning:

Encoder network: Learn encoding of an image
Feed encoding to RNN and generate sequence

Language Model: $P(y^1, ..., y^{Ty})$

Machine translation: Has encoder network as input

“Conditional language model” $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

P(english | french)

arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

Solution: Beam Search

Jane is visiting … Jane is going …

P(Jane is going|x) > P(Jane is visiting | x)

Going more probable as follow up word, but sentence isn't better

Most likely output to be searched.

Step 1:

Vocabulary of 10000 words

$P(y^1 | x)$

Encoder ⇒ Output $\hat{y}^1$

Parameter Beam width B = 3

Keep track of B most likely words

Step 2:

For B most likely choices:

Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$

$P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$

Evaluate all 10000 options for each likely word

Remember only most likely 3 choices for first and second word in each step

$B=1$ ⇒ Greedy

Length normalization:

Probabilities are near zero
Take logs to make it more numerical stable

Long sentences are more unlikely, because of more multiplications

Normalize with $1/T_y^\alpha$

How to choose B?

Large: Better result, but slower
Small: Worse result, but faster
Production: 10 - 100
Research: 1000 - 3000

Beam search is not guaranteed to find exact maximum, unlike BFS, DFS

Attribute error to RNN or Beam search?

RNN computes P(y|x)

Compute P(good translation|x)

Compute P(algo translation|x)

Case 1: good trans has higher val: Beam search chose algo trans ⇒ Beam search problem
Case 2: algo trans has higher val: RNN at fault

Multiple similar good translations ⇒ Compute score how good translation is.

Work part by part through sentence, instead of remember complete long sentences.

Bidirectional RNN for input
Attention weights as input through context for second RNN

Sequence to sequence models

Picking most likely sentence

Why not greedy?

Beam search

Beam search refinement

Error analysis in beam search

Bleu score

Attention model

AE Wiki