Sequence to sequence models
- Encoder network (RNN): Output: encoding
- Decoder network: Input: encoded output
Image captioning:
- Encoder network: Learn encoding of an image
- Feed encoding to RNN and generate sequence
Picking most likely sentence
Language Model: $P(y^1, ..., y^{Ty})$
Machine translation: Has encoder network as input
“Conditional language model” $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$
P(english | french)
arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$
Solution: Beam Search
Why not greedy?
Jane is visiting … Jane is going …
P(Jane is going|x) > P(Jane is visiting | x)
Going more probable as follow up word, but sentence isn't better
Beam search
Most likely output to be searched.
Step 1:
Vocabulary of 10000 words
$P(y^1 | x)$
Encoder ⇒ Output $\hat{y}^1$
Parameter Beam width B = 3
Keep track of B most likely words
Step 2:
For B most likely choices:
Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$
$P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$
Evaluate all 10000 options for each likely word
Remember only most likely 3 choices for first and second word in each step
$B=1$ ⇒ Greedy
Beam search refinement
Length normalization:
- Probabilities are near zero
- Take logs to make it more numerical stable
Long sentences are more unlikely, because of more multiplications
- Normalize with $1/T_y^\alpha$
How to choose B?
- Large: Better result, but slower
- Small: Worse result, but faster
- Production: 10 - 100
- Research: 1000 - 3000
Beam search is not guaranteed to find exact maximum, unlike BFS, DFS
Error analysis in beam search
Attribute error to RNN or Beam search?
RNN computes P(y|x)
Compute P(good translation|x)
Compute P(algo translation|x)
- Case 1: good trans has higher val: Beam search chose algo trans ⇒ Beam search problem
- Case 2: algo trans has higher val: RNN at fault
Bleu score
Multiple similar good translations ⇒ Compute score how good translation is.
Attention model
Work part by part through sentence, instead of remember complete long sentences.
- Bidirectional RNN for input
- Attention weights as input through context for second RNN