Skip to main content

3 Word Representation

Language Model

P(w1,w2,...,wn)=i=1nP(wiw1w2...wi1)P(w_1,w_2,...,w_n) = \prod_{i = 1}^nP(w_i|w_1w_2...w_{i-1})

N-gram: distribution of next word is a categorical conditioned on previous N-1 words, P(wiw1w2...wi1)=P(wiwiN+1...wi1)P(w_i|w_1w_2...w_{i-1}) = P(w_i|w_{i-N+1}...w_{i-1})

I visited San ____

  • Unigram: mutual indepedent
  • Bigram: P(w|San)
  • 3-gram: P(w|visited San)


  • P(wvisitedSan)=(1λ)count(visitedSan,w)count(visitedSan)+λcount(San,w)count(San)P(w|visited San) = (1 - \lambda) \frac{count(visited San, w)} {count(visited San)} + \lambda \frac{count(San, w)}{count(San)}
  • P(wvisitedSan)=count(visitedSan,w)kcount(visitedSan)+λcount(San,w)count(San)P(w|visited San) = \frac{count(visited San, w) - k} {count(visited San)} + \lambda \frac{count(San, w)}{count(San)}

Word Embedding

Problems with wordnet

  • Requires human labor to create and adapt
  • Impossible to keep up-to-date
  • Can’t be used to accurately compute word similarity




Hierarchical Softmax

  • Matmul + softmax over |V| (# of words) is very slow to compute for CBOW and SG
  • Huffman encode vocabulary, use binary classifiers to decide which branch to take: log(|V|)



Construct a Global Vectors for Word Representation


  • Pros: Efficient for large corpora
  • Cons: Relatively slow for small or medium corpora


  • It is a kind of aggregated word2vec/CBOW
    • word2vec mainly focuses on local sliding windows
    • GloVe is able to combine global and local features
  • More flexible with the values in matrix
    • log, PMI variants, ... many tricks can be played!