In general, language modeling is the task of predicting what word comes next.
Types of LMs
- Statistical LM (N-gram models)
- Neural LM (e.g. RNNs)
- Pre-trained LM (e.g. BERT)
- Large LM (LLM) (e.g. ChatGPT, Claude)
N-gram Language Models
Given a sequence of words $x_1,x_2,…,x_t$ compute the probability distribution of the next word $x_{t+1}: P(x_{t+1} = wj)|x_1,x_2,…,x_t)$ where $w_j$ is a word in the vocabulary $V = {w_1, …, w_{|V|}}$.
“An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram) is a two-word sequence of words like The water, or water of, and a 3-gram (a trigram) is a three-word sequence of words like The water of, or water of Walden. But we also (in a bit of terminological ambiguity) use the word ‘n-gram’ to mean a probabilistic model that can estimate the probability of a word given the n-1 previous words, and thereby also to assign probabilities to entire sequences.” Speech and Language Processing, Chapter 3 - N-gram Language Models
Finding the probability of the next word:
- Numerator: Probability of an n-gram
- Denominator: Probability of an (n-1)-gram
Can we just simply count and divide number of occurrences in text? NO! There are too many possible sentences. We’ll never see enough data for estimating the probabilities.
“If we had a large enough corpus, we could compute these two counts and estimate the probability from Eq. 3.2. But even the entire web isn’t big enough to give us good estimates for counts of entire sentences. This is because language is creative; new sentences are invented all the time, and we can’t expect to get accurate counts for such large objects as entire sentences. For this reason, we’ll need more clever ways to estimate the probability of a word w given a history h, or the probability of an entire word sequence W.” Speech and Language Processing, Chapter 3 - N-gram Language Models
Markov Assumption
The probability of moving to the next state depends only on the present state and not on the previous states. Refer to the Chain Rule section in 1. Probability.
Because of this assumption, previous equation can be written as follows (Use the chain rule to see how terms cancel out. In the end, the equation can be written as a division of counts):
- ==Let’s say we have 10 words in our vocabulary and Count(I want) = 20. Does this mean P(I want) = 20 / 10*10*10 = 20 / 1000==
Intrinsic evaluation of N-gram models: Perplexity
Usually a bad approximation unless the test data looks just like the training data. Generally, only useful in pilot experiments, but still helpful to think about.
“The perplexity (sometimes abbreviated as PP or PPL) of a language model on a test set is the inverse probability of the test set (one over the probability of the test set), normalized by the number of words (or tokens).” Speech and Language Processing, Chapter 3 - N-gram Language Models
- Lower perplexity = better model. Perplexity of “1” always assigns probability 1 to the correct word).
(Use Chain Rule to find $P(W)$)
Example: W = “the cat sat” where $P(\text{the"}) = 0.5$, $P(\text{cat |
the”}) = 0.25$ and $P(\text{``sat | the cat”})) = 0.5$ |
$N$ is the number of words in the test sentence, not the number of words in vocabulary.
Dealing with unseen N-grams: Smoothing
N-grams don’t work well for predicting words that were not seen in the training data. For example, if you don’t have “cat jumped” in your training data, that n-gram will never be predicted.
-
P(“sat” “cat”) = 10/15 = 0.67 -
P(“ran” “cat”) = 5/15 = 0.33 -
**P(“jumped” “cat”) = 0/15 = 0 ← This case is impossible for the model, making perplexity infinite**
Laplace smoothing adds one to each count (hence its alternate name add-one smoothing). Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V observations.
- N = Total number of word tokens (all words in the corpus, counts repeats)
- V = Vocabulary size (number of unique words)
Previous example becomes,
-
P(“sat” “cat”) = $11 \over 15 + V$ -
P(“ran” “cat”) = $6 \over 15 + V$ -
P(“jumped” “cat”) = $1 \over 15 + V$
“Laplace smoothing does not perform well enough to be used in modern n-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification.” Speech and Language Processing, Chapter 3 - N-gram Language Models
In practice, add-one gives way too much probability mass to unseen events (overestimation) and too little to seen events (probabilities for seen events become tiny). A side effect of this is loss of discrimination. As probabilities get smaller, events that are normally more likely are not represented well (they get similar to low probability events). Add-one smoothing is also sensitive to vocabulary size, if V is large probabilities become even smaller.
Other smoothing methods:
- Add-k smoothing (add a smaller value $k<1$ instead of 1)
- Witten-Bell smoothing
- ==Kneser-Ney smoothing==
- ==Good Turing smoothing==
Interpolation
If we are trying to compute $P(w_n|w_{n−2}w_{n−1})$ but we have no examples of a particular trigram $w_{n−2}w_{n−1}w_n$ , we can instead estimate its probability by using the bigram probability $P(w_n|w_{n−1})$. Similarly, if we don’t have counts to compute $P(w_n|w_{n−1})$, we can look to the unigram $P(w_n)$. In other words, sometimes using less context can help us generalize more for contexts that the model hasn’t learned much about.
Sum of λs should equal to 1.
How to learn λs? Find optimal parameters that maximizes the probabilities of the held-out (validation) set.
Alternative interpolation technique: Absolute Discounting Interpolation
Stupid Backoff
Instead of assigning weights to different n-grams, we can use a “backoff” strategy. Backoff strategy gives up on trying to make the language model a probability distribution. If the higher order n-gram count has a zero count, this strategy simply uses the count of a lower order n-gram.
Limitations of N-grams
- Data Sparsity: As N increases, the number of possible N-grams grows exponentially (this also introduces storage issues). This means many valid N-grams will never appear in your training data, making it impossible to assign them meaningful probabilities. Even moderately-sized corpora will have limited coverage for N>3.
- 1-grams: 3 possibilities (“cat”, “dog”, “ran”)
- 2-grams: 3 × 3 = 9 possibilities (“cat cat”, “cat dog”, “cat ran”, “dog cat”, “dog dog”, …)
- 3-grams: 3 × 3 × 3 = 27 possibilities (“cat cat cat”, “cat cat dog”, “cat cat ran”, …)
What does “most valid N-grams will never appear in your training data” mean? Let’s say you have a vocabulary of 50,000 words and you’re building a 3-gram model.
- Theoretically possible 3-grams: 50,000³ = 125 trillion different combinations
- Your training corpus: Let’s say 1 billion words total
- That means roughly 1 billion 3-grams in your data
- Many will be duplicates (“the cat sat” appears many times)
- Unique 3-grams you actually see: Maybe 100 million So you’ve observed 100 million out of 125 trillion possible 3-grams (~0.00008%)!
- Context Window: N-grams only capture local context within a fixed window of N words. They can’t model long-range dependencies—for example, a 3-gram can only look at 2 preceding words, missing important relationships between words further apart in a sentence.
- No Semantic Understanding: N-grams are purely statistical patterns—they don’t understand meaning. “dog” and “canine” are treated as completely different tokens, even though they’re synonyms. Similarly, “not good” and “bad” have no recognized relationship.
- Out-of-Vocabulary Words: N-grams can’t handle words they’ve never seen during training. There’s no mechanism to reason about unknown words based on morphology or context. They also do not generalize well compared to other types of LMs.
- Rigid Word Order: They’re extremely sensitive to exact word sequences. “The cat sat” and “The feline sat” are completely different 3-grams, even though they express similar ideas.