====== Sequence to sequence models ====== * Encoder network (RNN): Output: encoding * Decoder network: Input: encoded output Image captioning: * Encoder network: Learn encoding of an image * Feed encoding to RNN and generate sequence ===== Picking most likely sentence ===== Language Model: $P(y^1, ..., y^{Ty})$ Machine translation: Has encoder network as input "Conditional language model" $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$ P(english | french) arg max ${y^1, ..., y^{Ty}}$ $P(y^1, ..., y^{Ty}| x^1, ..., x^{Tx})$ Solution: Beam Search ==== Why not greedy? ==== Jane is visiting ... Jane is going ... P(Jane is going|x) > P(Jane is visiting | x) Going more probable as follow up word, but sentence isn't better ==== Beam search ==== Most likely output to be searched. **Step 1:** Vocabulary of 10000 words $P(y^1 | x)$ Encoder => Output $\hat{y}^1$ Parameter Beam width B = 3 Keep track of B most likely words **Step 2:** For B most likely choices: Use word c1, wire $\hat{y}^1$ set to likely word with $\hat{y}^2$ to get $P(y^2 | x, in)$ $P(y^1, y^2 | x) = P(y^1 | x) * P ( y^2 | x, y^1)$ Evaluate all 10000 options for each likely word Remember **only** most likely 3 choices for first and second word in each step $B=1$ => Greedy ==== Beam search refinement ==== Length normalization: * Probabilities are near zero * Take logs to make it more numerical stable Long sentences are more unlikely, because of more multiplications * Normalize with $1/T_y^\alpha$ How to choose B? * Large: Better result, but slower * Small: Worse result, but faster * Production: 10 - 100 * Research: 1000 - 3000 Beam search is not guaranteed to find exact maximum, unlike BFS, DFS ==== Error analysis in beam search ==== Attribute error to RNN or Beam search? RNN computes P(y|x) Compute P(good translation|x) Compute P(algo translation|x) * Case 1: good trans has higher val: Beam search chose algo trans => Beam search problem * Case 2: algo trans has higher val: RNN at fault ==== Bleu score ==== Multiple similar good translations => Compute score how good translation is. ==== Attention model ==== Work part by part through sentence, instead of remember complete long sentences. * Bidirectional RNN for input * Attention weights as input through context for second RNN