====== Word embeddings ====== ===== Basics ===== ==== Analogies ==== Man -> Woman is like King -> ? Example: 4 dim embedding (Gender, royal, age, food): * $e_{man} - e_{woman} \approx (-2, 0, 0, 0)^T$ * $e_{king} - e_{queen} \approx (-2, 0, 0, 0)^T$ Goal: Find word $w$, that maximaizes $sim(e_w, e_{king} - e_{man} + e_{woman})$ Cosine similarity often used as similarity function $sim(u,v) = \frac{u^T v}{||u||_2 ||v||_2}$ ==== Embedding matrix ==== Dimensions 10000 x 300 * Dictionary with 10000 entries * 300 features? Embedding vector obtained with one-hot encoding $o_j$ : $E * o_j = e_j$ Goal: Learn embedding matrix $E$. Embedding layer in Keras ===== Algorithms ===== ==== Neural language model ==== Given 4 words in sequence, what is next word (using E as parameter). Maximize likelihood with gradient descent. Other context: Can be used to learn a **word embedding** Context: 4 words on left and right Or last 1 word Or nearby 1 word ("skip gram") ==== Word2Vec ==== Context and Target "I want a glass of orange juice to go along with my cereal." Context: orange Pick target by chance within a window: juice or glass or ... Model: * Vocab size = 10.000 * Learn Kontext c ("orange") => Target t ("juice") * $o_c => E => e_c => o_{softmax} => \hat{y}$ * Softmax has $\Theta_t$ parameter * $L(\hat{y},y) = - \Sigma^{10000}_{i=1} y_i log \hat{y}_i$ * $y$ is one hot vector (10000 dim) Problems with softmax classification: Slow due to summing over dimension Solution: Hierarchical softmax: Tree of classifiers $log |v|$. Common words on top, not a balanced tree. === How to sample context c? === When uniformly random: often frequent words like "the, of, a, ..." Heuristics are used for sampling ==== Negative Sampling ==== Generate data set * Pick 1 positive example * Pick k negative examples * Choose random word from dicitionary which are not associated with context word: target = 0 * Heuristic between uniform and observed distribution 10000 binary classification problems ==== GloVe word vectors ==== Global vectors for word representation $x_{ij}$: Number of times i appears in context of j Minimize $\sum_{i=1}^{10000} \sum_{j=1}^{10000} f(x_{ij}) (\Theta_i^{T} e_j + b_i - b'_j - log x_{ij})^2$ Weighting term $f(x_{ij})$: Weight for frequent, infrequent words $e^{final}_w = \frac{e_w + \Theta_w}{2}$ ===== Application ===== ==== Sentiment classification ==== === Simple model === * Extract embedding vector for each word * Sum or Avg those vectors * Pass to softmax to gain output (1-5 stars) Problem: Doesn't include order/sequence of words === RNN for sentiment classification === * Extract embedding vector for each word * Feed into RNN with softmax output ===== Debiasing word embeddings ===== Bias in text Addressing bias in word embessing: - Identify bias direction (e.g. gender) * $e_{he} - e_{she}$, average them - Neutralize: For every word that is not definitial (legitimate gender component), project - Equalize pairs: Only difference should be gender (e.g. grandfather vs. grandmother); equidistant