Word embeddings

Man → Woman is like King → ?

Example: 4 dim embedding (Gender, royal, age, food):

  • $e_{man} - e_{woman} \approx (-2, 0, 0, 0)^T$
  • $e_{king} - e_{queen} \approx (-2, 0, 0, 0)^T$

Goal: Find word $w$, that maximaizes $sim(e_w, e_{king} - e_{man} + e_{woman})$

Cosine similarity often used as similarity function

$sim(u,v) = \frac{u^T v}{||u||_2 ||v||_2}$

Dimensions 10000 x 300

  • Dictionary with 10000 entries
  • 300 features?

Embedding vector obtained with one-hot encoding $o_j$ : $E * o_j = e_j$

Goal: Learn embedding matrix $E$.

Embedding layer in Keras

Given 4 words in sequence, what is next word (using E as parameter).

Maximize likelihood with gradient descent.

Other context:

Can be used to learn a word embedding

Context: 4 words on left and right Or last 1 word Or nearby 1 word (“skip gram”)

Context and Target

“I want a glass of orange juice to go along with my cereal.”

Context: orange Pick target by chance within a window: juice or glass or …


  • Vocab size = 10.000
  • Learn Kontext c (“orange”) ⇒ Target t (“juice”)
  • $o_c => E => e_c => o_{softmax} => \hat{y}$
  • Softmax has $\Theta_t$ parameter
  • $L(\hat{y},y) = - \Sigma^{10000}_{i=1} y_i log \hat{y}_i$
  • $y$ is one hot vector (10000 dim)

Problems with softmax classification: Slow due to summing over dimension

Solution: Hierarchical softmax: Tree of classifiers $log |v|$. Common words on top, not a balanced tree.

How to sample context c?

When uniformly random: often frequent words like “the, of, a, …”

Heuristics are used for sampling

Generate data set

  • Pick 1 positive example
  • Pick k negative examples
    • Choose random word from dicitionary which are not associated with context word: target = 0
    • Heuristic between uniform and observed distribution

10000 binary classification problems

Global vectors for word representation

$x_{ij}$: Number of times i appears in context of j

Minimize $\sum_{i=1}^{10000} \sum_{j=1}^{10000} f(x_{ij}) (\Theta_i^{T} e_j + b_i - b'_j - log x_{ij})^2$

Weighting term $f(x_{ij})$: Weight for frequent, infrequent words

$e^{final}_w = \frac{e_w + \Theta_w}{2}$

Simple model

  • Extract embedding vector for each word
  • Sum or Avg those vectors
  • Pass to softmax to gain output (1-5 stars)

Problem: Doesn't include order/sequence of words

RNN for sentiment classification

  • Extract embedding vector for each word
  • Feed into RNN with softmax output

Bias in text

Addressing bias in word embessing:

  1. Identify bias direction (e.g. gender)
    • $e_{he} - e_{she}$, average them
  2. Neutralize: For every word that is not definitial (legitimate gender component), project
  3. Equalize pairs: Only difference should be gender (e.g. grandfather vs. grandmother); equidistant
  • data_mining/neural_network/word_embeddings.txt
  • Last modified: 2018/06/09 18:40
  • by phreazer