Show pageOld revisionsBacklinksBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Recurrent Neural Networks ====== ===== Notation ===== E.g. named entity recognition 9 Words => 9 Features $x^{(i)<t>}; T_x^{(i)} = 9$ i-th training example, t-th element $y^{(i)<t>}; T_y^{(i)}= 9$ Representation of words in a sentence: Vocabulary / Dictionary vector (e.g. 10000) One-hot representation for $x^{(i)<t>}$ Fake word in dictionary for unknown words ===== Recurrent Neural Network ===== Problems with simple input-output FCN model: * Inputs, output can be different length in different examples * Doesn't share features learned across different position of text {{ :data_mining:neural_network:rnn.png?nolink&400 |}} Problem with RNN: Only words before word is used for prediction (Solution Bidir. RNN) $a^0=0$ $a^1=g(W_{aa} a^0 + W_{ax} x^1 + b_a)$ using tanh or ReLu as activation function $\hat{y}^1=g(W_{ya} a^1 + b_y)$ using sigmoid as activation function ===== RNN types ===== Example so far: Many-to-many architecture $T_x = T_y$ Other types: * Many-to-one: Sentinent classification: x = text, y = 1 or 0 * One-to-one: Like normal ANN * One-to-many: Music generation. $\hat{y}$ is used as input for next time step * Many-to-many: Machine translation (output length is different to input length). Encode takes input, Decoder generates outputs * Attention-based ===== Language model ===== What is probability of sentence? Training set: Large corpus of text in the language * Tokenize sentence * One hot encode * EOS token * UNK token (unknown word) $x^t = y^{t-1}$ Given known words, what's the probability for next word? ==== Sampling a sequence ==== Sample from softmax distribution $\hat{y}$ ===== Character level language model ===== Use chars instead of words. But more computationally expensive. ===== Vanishing gradient ===== Problem: Not very good at capturing long-term dependencies (single/plural subjects and verbs). **Stronger influenced by nearby variables**. Exploding gradients can happen, but easier to spot => NAN. Solution: **Gradient clipping** (if threshold is reached, clip gradient). Solution for vanishing gradient in following sections. ==== Gated Recurrent Unit ==== Improve **capturing long-term dependencies** and **vanishing gradient** problem. Example sentence (with singular/plural dep): The cat, which already ate ..., was full. $c^{<t>}$ = memory cell $c^{<t>}$ = $a^{<t>}$ for now Candidate for replacing $c^{<t>}$ : $\tilde{c}^{<t>} = \tanh(W_c[\Gamma_r * c^{<t-1>}, x^{<t>}]+ b_c)$ Gate update $\Gamma_u=\sigma(W_u[c^{<t-1>}, x^{<t>}]+ b_u)$ Gate relevance $\Gamma_r=\sigma(W_r[c^{<t-1>}, x^{<t>}]+ b_r)$ $c^{<t>}$ = $\Gamma_u * \tilde{c}^{<t>} + (1-\Gamma_u) * c^{<t-1>}$ Gamma determines when Candidate is used Update gate can be very close to 0, value remains nearly same ==== LSTM ==== $c^{<t>}$ != $a^{<t>}$ Candidate for replacing $c^{<t>}$ : $\tilde{c}^{<t>} = \tanh(W_c[a^{<t-1>}, x^{<t>}]+ b_c)$ Gate update $\Gamma_u=\sigma(W_u[a^{<t-1>}, x^{<t>}]+ b_u)$ Gate forget $\Gamma_f=\sigma(W_f[a^{<t-1>}, x^{<t>}]+ b_f)$ Gate output $\Gamma_o=\sigma(W_o[a^{<t-1>}, x^{<t>}]+ b_o)$ $c^{<t>}$ = $\Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>}$ (forget or update) $a^{<t>}$ = $\Gamma_o * c^{<t>}$ Peephole connection: $c^{<t>}$ affects gate ===== Bidirektional RNN ===== Take info from sequence on the right Going forward to last unit and back from there; Acyclic graph, two activations per step (forward, backward). Activation blocks can be GRU or LSTM ===== Deep RNN ===== data_mining/neural_network/sequences/recurrentnn.txt Last modified: 2020/05/29 16:23by phreazer