Table of Contents

Recurrent Neural Networks

Notation

E.g. named entity recognition

9 Words ⇒ 9 Features

$x^{(i)<t>}; T_x^{(i)} = 9$ i-th training example, t-th element

$y^{(i)<t>}; T_y^{(i)}= 9$

Representation of words in a sentence:

Vocabulary / Dictionary vector (e.g. 10000)

One-hot representation for $x^{(i)<t>}$

Fake word in dictionary for unknown words

Recurrent Neural Network

Problems with simple input-output FCN model:

Problem with RNN: Only words before word is used for prediction (Solution Bidir. RNN)

$a^0=0$

$a^1=g(W_{aa} a^0 + W_{ax} x^1 + b_a)$ using tanh or ReLu as activation function

$\hat{y}^1=g(W_{ya} a^1 + b_y)$ using sigmoid as activation function

RNN types

Example so far: Many-to-many architecture $T_x = T_y$

Other types:

Language model

What is probability of sentence?

Training set: Large corpus of text in the language

$x^t = y^{t-1}$

Given known words, what's the probability for next word?

Sampling a sequence

Sample from softmax distribution $\hat{y}$

Character level language model

Use chars instead of words.

But more computationally expensive.

Vanishing gradient

Problem: Not very good at capturing long-term dependencies (single/plural subjects and verbs). Stronger influenced by nearby variables.

Exploding gradients can happen, but easier to spot ⇒ NAN.

Solution: Gradient clipping (if threshold is reached, clip gradient).

Solution for vanishing gradient in following sections.

Gated Recurrent Unit

Improve capturing long-term dependencies and vanishing gradient problem.

Example sentence (with singular/plural dep): The cat, which already ate …, was full.

$c^{<t>}$ = memory cell

$c^{<t>}$ = $a^{<t>}$ for now

Candidate for replacing $c^{<t>}$ : $\tilde{c}^{<t>} = \tanh(W_c[\Gamma_r * c^{<t-1>}, x^{<t>}]+ b_c)$

Gate update $\Gamma_u=\sigma(W_u[c^{<t-1>}, x^{<t>}]+ b_u)$

Gate relevance $\Gamma_r=\sigma(W_r[c^{<t-1>}, x^{<t>}]+ b_r)$

$c^{<t>}$ = $\Gamma_u * \tilde{c}^{<t>} + (1-\Gamma_u) * c^{<t-1>}$

Gamma determines when Candidate is used

Update gate can be very close to 0, value remains nearly same

LSTM

$c^{<t>}$ != $a^{<t>}$

Candidate for replacing $c^{<t>}$ : $\tilde{c}^{<t>} = \tanh(W_c[a^{<t-1>}, x^{<t>}]+ b_c)$

Gate update $\Gamma_u=\sigma(W_u[a^{<t-1>}, x^{<t>}]+ b_u)$

Gate forget $\Gamma_f=\sigma(W_f[a^{<t-1>}, x^{<t>}]+ b_f)$

Gate output $\Gamma_o=\sigma(W_o[a^{<t-1>}, x^{<t>}]+ b_o)$

$c^{<t>}$ = $\Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>}$

(forget or update)

$a^{<t>}$ = $\Gamma_o * c^{<t>}$

Peephole connection: $c^{<t>}$ affects gate

Bidirektional RNN

Take info from sequence on the right

Going forward to last unit and back from there; Acyclic graph, two activations per step (forward, backward).

Activation blocks can be GRU or LSTM

Deep RNN