Image captioning:
Language Model: $P(y^1, ..., y^{Ty})$
Machine translation: Has encoder network as input
“Conditional language model” $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$
P(english | french)
arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$
Solution: Beam Search
Jane is visiting … Jane is going …
P(Jane is going|x) > P(Jane is visiting | x)
Going more probable as follow up word, but sentence isn't better
Most likely output to be searched.
Step 1:
Vocabulary of 10000 words
$P(y^1 | x)$
Encoder ⇒ Output $\hat{y}^1$
Parameter Beam width B = 3
Keep track of B most likely words
Step 2:
For B most likely choices:
Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$
$P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$
Evaluate all 10000 options for each likely word
Remember only most likely 3 choices for first and second word in each step
$B=1$ ⇒ Greedy
Length normalization:
Long sentences are more unlikely, because of more multiplications
How to choose B?
Beam search is not guaranteed to find exact maximum, unlike BFS, DFS
Attribute error to RNN or Beam search?
RNN computes P(y|x)
Compute P(good translation|x)
Compute P(algo translation|x)
Multiple similar good translations ⇒ Compute score how good translation is.
Work part by part through sentence, instead of remember complete long sentences.