data_mining:neural_network:sequences:sequence_to_sequence_model

Sequence to sequence models

• Encoder network (RNN): Output: encoding
• Decoder network: Input: encoded output

Image captioning:

• Encoder network: Learn encoding of an image
• Feed encoding to RNN and generate sequence

Language Model: $P(y^1, ..., y^{Ty})$

Machine translation: Has encoder network as input

“Conditional language model” $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

P(english | french)

arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$

Solution: Beam Search

Jane is visiting … Jane is going …

P(Jane is going|x) > P(Jane is visiting | x)

Going more probable as follow up word, but sentence isn't better

Most likely output to be searched.

Step 1:

Vocabulary of 10000 words

$P(y^1 | x)$

Encoder ⇒ Output $\hat{y}^1$

Parameter Beam width B = 3

Keep track of B most likely words

Step 2:

For B most likely choices:

Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$

$P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$

Evaluate all 10000 options for each likely word

Remember only most likely 3 choices for first and second word in each step

$B=1$ ⇒ Greedy

Length normalization:

• Probabilities are near zero
• Take logs to make it more numerical stable

Long sentences are more unlikely, because of more multiplications

• Normalize with $1/T_y^\alpha$

How to choose B?

• Large: Better result, but slower
• Small: Worse result, but faster
• Production: 10 - 100
• Research: 1000 - 3000

Beam search is not guaranteed to find exact maximum, unlike BFS, DFS

Attribute error to RNN or Beam search?

RNN computes P(y|x)

Compute P(good translation|x)

Compute P(algo translation|x)

• Case 1: good trans has higher val: Beam search chose algo trans ⇒ Beam search problem
• Case 2: algo trans has higher val: RNN at fault

Multiple similar good translations ⇒ Compute score how good translation is.

Work part by part through sentence, instead of remember complete long sentences.

• Bidirectional RNN for input
• Attention weights as input through context for second RNN
• data_mining/neural_network/sequences/sequence_to_sequence_model.txt