Skip to main content

3 NLP basics

Language Model​

P(w1:n)=∏i=1nP(wi∣w1:iβˆ’1)P(w_{1:n}) = \prod_{i = 1}^nP(w_i|w_{1:i-1})

N-gram: distribution of next word is a categorical conditioned on previous N-1 words, P(wi∣w1:iβˆ’1)=P(wi∣wiβˆ’N+1:iβˆ’1)P(w_i|w_{1:i-1}) = P(w_i|w_{i-N+1:i-1})

I visited San ____

  • Unigram: mutual indepedent
  • Bigram: P(w|San)
  • 3-gram: P(w|visited San)

Smoothing

  • Add-1 estimate(Laplace Smoothing): PLaplace(Wn∣Wnβˆ’1)=C(wnβˆ’1wn)+1βˆ‘w(C(wnβˆ’1w)+1)=C(wnβˆ’1wn)+1C(wnβˆ’1)+VP_{\text{Laplace}}(W_n \mid W_{n-1}) = \frac{C(w_{n-1}w_n) + 1}{\sum_w (C(w_{n-1}w) + 1)} = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V}
  • Backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram
  • Interpolation: mix unigram, bigram, trigram

Perplexity: inverse probability of the test set, normalized by the number of words

PP(WW)=P(w1...N)βˆ’1N=1P(w1...N)NPP(WW) = P(w_{1...N})^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_{1...N})}}

Word Embedding​

Problems with tf-idf(Term frequency-Inverse document frequency)

  • long(length ∣V∣|V|= 20,000 to 50,000)
  • sparse(most elements are zero)

tij=tfijΓ—idfj=(fijmi)Γ—(log⁑2(nnj)+1)t_{ij} = \text{tf}_{ij} \times \text{idf}_{j} = \left( \frac{f_{ij}}{m_i} \right) \times \left( \log_2 \left( \frac{n}{n_j} \right) + 1 \right) where mi=max⁑(fij)m_i = \max(f_{ij})

Word2Vec​

Dense vectors

  • Short vectors may be easier to use as features in machine learning (fewer weights to tune)
  • Dense vectors may generalize better than explicit counts
  • Dense vectors may do better at capturing synonymy(car and automobile)

20230612204858

Negative Sampling

  • Start with V random d-dimensional vectors as initial embeddings
  • Train a classifier based on embedding similarity
    • Take a corpus and take pairs of words that co-occur as positive examples
    • Grab k negative examples for each target word
    • Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance

20241118170336

Lneg=βˆ’log⁑[P(+∣wt,cpos)∏k=1Klog⁑P(βˆ’βˆ£wt,cnegk)]=βˆ’log⁑σ(ucpos⊀vwt)βˆ’βˆ‘k=1Klog⁑σ(βˆ’ucnegk⊀vwt)\mathcal{L}_{neg} = -\log [P(+ \mid w_t, c_{\text{pos}}) \prod_{k=1}^K \log P(- \mid w_t, c_{\text{neg}_k})] = -\log \sigma(\mathbf{u}_{c_{\text{pos}}}^\top \mathbf{v}_{w_t}) - \sum_{k=1}^K \log \sigma(-\mathbf{u}_{c_{\text{neg}_k}}^\top \mathbf{v}_{w_t})

Where:

  • KK: Number of negative samples.
  • vwt\mathbf{v}_{w_t}: Input embedding of the target word wtw_t.
  • ucpos\mathbf{u}_{c_{\text{pos}}}: Output embedding of the positive context word cposc_{\text{pos}}.
  • ucneg\mathbf{u}_{c_{\text{neg}}}: Output embedding of a negative context word cnegc_{\text{neg}}.

Hierarchical Softmax

  • Matmul + softmax over |V| (# of words) is very slow to compute for CBOW and SG
  • Huffman encode vocabulary, use binary classifiers to decide which branch to take: log(|V|)

20230613113255

Golve​

Construct a Global Vectors for Word Representation

Efficiency

  • Pros: Efficient for large corpora
  • Cons: Relatively slow for small or medium corpora

Effectiveness

  • It is a kind of aggregated word2vec/CBOW
    • word2vec mainly focuses on local sliding windows
    • GloVe is able to combine global and local features
  • More flexible with the values in matrix
    • log, PMI variants, ... many tricks can be played!

Sequence labeling​

Part of Speech Tagging​

20241118182443

Named Entity Recognition​

Most common

  • PER (Person): β€œMarie Curie”
  • LOC (Location): β€œNew York City”
  • ORG (Organization): β€œStanford University”
  • GPE (Geo-Political Entity): "Boulder, Colorado"

IO vs BIO vs BIOES

20241118182657

HMM(Hidden Markov Models)​

Tag assignment: w=w1..n,t=t1..n\bold{w} = w_{1..n}, \bold{t} = t_{1..n}

Bigram Assumption:

P(t1,t2,...,tn,w1,w2,...,wn)=P(t1)β‹…βˆkP(tk∣tkβˆ’1)β‹…βˆkP(wk∣tk)P(t_1, t_2, ..., t_n, w_1, w_2, ..., w_n) = P(t_1) \cdot \prod_{k} P(t_k \mid t_{k-1}) \cdot \prod_{k} P(w_k \mid t_k)

where

  • Transition Probabilities P(tk∣tkβˆ’1)P(t_k \mid t_{k-1}): The probability of transitioning from one state to another.
  • Emission Probabilities P(wk∣tk)P(w_k \mid t_k): The probability of observing a specific observation given a state.
  • tβˆ—=argmaxtP(w,t)\bold{t}^* = \text{argmax}_t P(\bold{w},\bold{t})