In general, language modeling is the task of predicting what word comes next.

Types of LMs

  • Statistical LM (N-gram models)
  • Neural LM (e.g. RNNs)
  • Pre-trained LM (e.g. BERT)
  • Large LM (LLM) (e.g. ChatGPT, Claude)

N-gram Language Models

Given a sequence of words compute the probability distribution of the next word where is a word in the vocabulary .

“An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram) is a two-word sequence of words like The water, or water of, and a 3-gram (a trigram) is a three-word sequence of words like The water of, or water of Walden. But we also (in a bit of terminological ambiguity) use the word ‘n-gram’ to mean a probabilistic model that can estimate the probability of a word given the n-1 previous words, and thereby also to assign probabilities to entire sequences.” Speech and Language Processing, Chapter 3 - N-gram Language Models

Finding the probability of the next word:

  • Numerator: Probability of an n-gram
  • Denominator: Probability of an (n-1)-gram

Can we just simply count and divide number of occurrences in text? NO! There are too many possible sentences. We’ll never see enough data for estimating the probabilities.

“If we had a large enough corpus, we could compute these two counts and estimate the probability from Eq. 3.2. But even the entire web isn’t big enough to give us good estimates for counts of entire sentences. This is because language is creative; new sentences are invented all the time, and we can’t expect to get accurate counts for such large objects as entire sentences. For this reason, we’ll need more clever ways to estimate the probability of a word w given a history h, or the probability of an entire word sequence W.” Speech and Language Processing, Chapter 3 - N-gram Language Models

Markov Assumption

The probability of moving to the next state depends only on the present state and not on the previous states. Refer to the Chain Rule section in 1. Probability.

Because of this assumption, previous equation can be written as follows (Use the chain rule to see how terms cancel out. In the end, the equation can be written as a division of counts):

Simplifying assumption:

  • P(“the” | “its water is so transparent that”) = P(“the”|”that”) OR
  • P(“the” | “its water is so transparent that”) = P(“the”|”transparent that”) depending on the model you choose

Practical issues: For computing efficiency, we do everything in log space to avoid underflow.

Why use Markov Assumption?

  1. Because it solves the sparsity issue. Without the assumption probability of the next word would almost always be 0.
  2. Computational efficiency. It takes less time to compute for words on a limited frame.
  3. Storage efficiency. It’s easier to store smaller n-grams.

Evaluation of N-gram Models

Extrinsic evaluation of N-gram models
  • Actually put the models (A and B) in a downstream task (spelling correction, translation) and compare accuracies.
  • Limitation: Time consuming
Intrinsic evaluation of N-gram models: Perplexity

Usually a bad approximation unless the test data looks just like the training data. Generally, only useful in pilot experiments, but still helpful to think about.

“The perplexity (sometimes abbreviated as PP or PPL) of a language model on a test set is the inverse probability of the test set (one over the probability of the test set), normalized by the number of words (or tokens).” Speech and Language Processing, Chapter 3 - N-gram Language Models

  • Lower perplexity = better model. Perplexity of “1” always assigns probability 1 to the correct word).

(Use Chain Rule to find )

Example: W = “the cat sat” where , and

is the number of words in the test sentence, not the number of words in vocabulary.

What is the perplexity of a sentence if it contains a word that is not in the vocabulary? Technically, infinite. This is also called the “zero probability” problem since the it is division by zero. Two ways to deal with this: <UNK> tokens or smoothing. You can assign rare or unseen words to special <UNK> tokens so that they also get a probability assignment. Or you can use both at the same time: create the tokens, apply smoothing and assign any unknown word in the test set to the unknown token.

Because perplexity depends entirely on the vocabulary size and how out-of-vocabulary words are handled, you cannot compare the perplexity of two models unless they use the exact same vocabulary. A model with a tiny vocabulary might have a “better” (lower) perplexity simply because it’s ignoring the complexity of rare words by dumping them all into a single <UNK> bucket.

Dealing with unseen N-grams: Smoothing

N-grams don’t work well for predicting words that were not seen in the training data. For example, if you don’t have “cat jumped” in your training data, that n-gram will never be predicted.

  • P(“sat” | “cat”) = 10/15 = 0.67
  • P(“ran” | “cat”) = 5/15 = 0.33
  • P(“jumped” | “cat”) = 0/15 = 0 ← This case is impossible for the model, making perplexity infinite

Laplace smoothing adds one to each count (hence its alternate name add-one smoothing). Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V observations.

  • N = Total number of word tokens (all words in the corpus, counts repeats)
  • V = Vocabulary size (number of unique words)

Previous example becomes,

  • P(“sat” | “cat”) =
  • P(“ran” | “cat”) =
  • P(“jumped” | “cat”) =

“Laplace smoothing does not perform well enough to be used in modern n-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification.” Speech and Language Processing, Chapter 3 - N-gram Language Models

In practice, add-one gives way too much probability mass to unseen events (overestimation) and too little to seen events (probabilities for seen events become tiny). A side effect of this is loss of discrimination. As probabilities get smaller, events that are normally more likely are not represented well (they get similar to low probability events). Add-one smoothing is also sensitive to vocabulary size, if V is large probabilities become even smaller.

Other smoothing methods:

  • Add-k smoothing (add a smaller value instead of 1)
  • Good Turing smoothing/estimation
  • Witten-Bell smoothing
  • Kneser-Ney smoothing (state-of-the-art for n-grams)

Add-k Smoothing

Unigram Prior Smoothing

where

Good-Turing Estimation A more sophisticated way compared Add-1 Smoothing is Good-Turing Estimation. The idea: reallocate the probability distribution of n-grams that were seen once to n-grams that were never seen.

where

Example: We have; 10 carps, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel

Probably of and where

Kneser-Ney Smoothing Instead of P(w): “How likely is w” ask, P(continuation(w)): “How likely is w to appear as a novel continuation?“. Example, “Francisco” is more common than “glasses” but “Francisco” always follows “San”. Kneser-Ney Smoothing calculates how many times w appears as a novel continuation normalized by the total number of bigram types.

Key Concepts

  • Absolute Discounting: The model subtracts a small fixed value from the counts of common phrases. This “leftover” probability is saved to help guess words it hasn’t seen in that specific context.
  • Continuation Probability: This is the “secret sauce.” Instead of asking “How common is this word?”, it asks “In how many different contexts does this word appear?”
  • Versatility vs. Frequency: A word like “apple” is versatile because it follows many words (eat, red, big). A word like “Francisco” is not versatile because it usually only follows one word (San). Kneser-Ney gives higher odds to versatile words when it’s unsure.

Interpolation

If we are trying to compute but we have no examples of a particular trigram , we can instead estimate its probability by using the bigram probability . Similarly, if we don’t have counts to compute , we can look to the unigram . In other words, sometimes using less context can help us generalize more for contexts that the model hasn’t learned much about.

Sum of λs should equal to 1.

How to learn λs? Find optimal parameters that maximizes the probabilities of the held-out (validation) set.

Absolute Discounting Interpolation

Stupid Backoff

Instead of assigning weights to different n-grams, we can use a “backoff” strategy. Backoff strategy gives up on trying to make the language model a probability distribution. If the higher order n-gram count has a zero count, this strategy simply uses the count of a lower order n-gram.

Dealing with Huge Web Scale N-grams

  • Pruning: Only store n-grams with count > threshold OR apply entropy based pruning

Limitations of N-grams

  • Data Sparsity: As N increases, the number of possible N-grams grows exponentially (this also introduces storage issues). This means many valid N-grams will never appear in your training data, making it impossible to assign them meaningful probabilities. Even moderately-sized corpora will have limited coverage for N>3.
    • 1-grams: 3 possibilities (“cat”, “dog”, “ran”)
    • 2-grams: 3 × 3 = 9 possibilities (“cat cat”, “cat dog”, “cat ran”, “dog cat”, “dog dog”, …)
    • 3-grams: 3 × 3 × 3 = 27 possibilities (“cat cat cat”, “cat cat dog”, “cat cat ran”, …)

What does “many valid N-grams will never appear in your training data” mean? Let’s say you have a vocabulary of 50,000 words and you’re building a 3-gram model.

  • Theoretically possible 3-grams: 50,000³ = 125 trillion different combinations
  • Your training corpus: Let’s say 1 billion words total
    • That means roughly 1 billion (meaningful) 3-grams in your data
    • Many will be duplicates (“the cat sat” appears many times)
    • Unique 3-grams you actually see: Maybe 100 million So you’ve observed 100 million out of 125 trillion possible 3-grams (~0.00008%)!
  • Context Window: N-grams only capture local context within a fixed window of N words. They can’t model long-range dependencies—for example, a 3-gram can only look at 2 preceding words, missing important relationships between words further apart in a sentence.
  • No Semantic Understanding: N-grams are purely statistical patterns—they don’t understand meaning. “dog” and “canine” are treated as completely different tokens, even though they’re synonyms. Similarly, “not good” and “bad” have no recognized relationship.
  • Out-of-Vocabulary Words: N-grams can’t handle words they’ve never seen during training. There’s no mechanism to reason about unknown words based on morphology or context. They also do not generalize well compared to other types of LMs.
  • Rigid Word Order: They’re extremely sensitive to exact word sequences. “The cat sat” and “The feline sat” are completely different 3-grams, even though they express similar ideas.