Natural Language Processing (NLP)

Other

Probabilistic language models

Introduction

Probabilistic language models can predict future words given a history of words.

This can be used for predictive text. For example if a user types "Did you call your" we may want to estimate the probability that the next word is "child".

We can state this problem:

\(P(child|did\ you\ call\ your)\)

By definition this is:

\(P(child|did\ you\ call\ your)= \dfrac{P(did\ you\ call\ your\ child)}{P(did\ you\ call\ your)}\)

We can estimate each of these:

\(P(did\ you\ call\ your\ child)=\dfrac{|did\ you\ call\ your\ child|}{|5\ word\ sentences|}\)

\(P(did\ you\ call\ your)=\dfrac{|did\ you\ call\ your|}{|4\ word\ sentences|}\)

Data requirements

This needs a large corpus, which may not be practical.

Additionallly, the words must be indexed, and not simply stored as a bag of words.

Decomposition

We can decompose the probabilities using the chain rule.

\(P(did\ you\ call\ your\ child)=P(did)P(you|did)...P(child|did\ you\ call\ your)\)

\(P(w_1,...,w_k)= \prod_k p(w_k|w_1,...,w_{k-1})\)

N-grams

We can simplify the decomposition using the Markov assumption:

\(P(w_k|w_1,...,w_{k-1})=P(w_k|w_{k-1})\)

This is a \(1\)-gram.

We can do this for \(n\) words back. This is an \(n\)-gram.

Smoothing

We can use smoothing to address small corpuses.

\(P(did\ you\ call\ your\ child)=\dfrac{|did\ you\ call\ your\ child|+1}{|5\ word\ sentences|+V}\)

\(P(did\ you\ call\ your)=\dfrac{|did\ you\ call\ your|+1}{|4\ word\ sentences|+V}\)

For some value \(V\).

Perplexity

We can compare probabilistic language models using perplexity.

We can then choose the model with the lowest perplexity.

\(perplexity(w_1, w_2, ..., w_n)=P(w_1, w_2, ..., w_n)^{-\dfrac{1}{n}}\)

We can expand this:

\(perplexity(w_1, w_2, ..., w_n)=\prod_i P(w_i| w_1, ..., w_{i-1})^{-\dfrac{1}{n}}\)

Depending on which n-gram we use we can then simplify this.

Word2vec

Latent Semantic Analysis

Machine translation

Machine translation