5. Tokenization

Tokenization Strategies

Character Tokenization: The simplest tokenization scheme is to feed each character individually to the model.
- Ignores any structure in the text and treats the whole string as a stream of characters.
- Helps deal with misspellings and rare words.
- But the main drawback is that linguistic structures such as words need to be learned from the data. This requires significant compute, memory, and data.
Word Tokenization: Split into words and map each word to an integer.
- One simple way to do it splitting sentences by whitespace.
- Given that words can include declinations, conjugations, or misspellings, the size of the vocabulary can easily grow into the millions! Which results in models with lots of parameters (expensive to train).
- A common approach is to limit the vocabulary and discard rare words by considering, say, the 100,000 most common words in the corpus. Assign rare or unknown words to <UNK> tokens.
Subword Tokenization: combine the best aspects of character and word tokenization.
- Idea: We want to split rare words into smaller units to allow the model to deal with complex words and misspellings.
- We want to keep frequent words as unique entities so that we can keep the length of our inputs to a manageable size.
- Advantage when it comes to unknown words: Instead of mapping a new word like “unhappily” to a single <UNK> token because it wasn’t in the training set, the model can decompose it into un-, happi, and -ly. This means the model can still assign a meaningful probability to the sequence because it knows the meaning and behavior of those individual sub-units. This effectively creates a fixed-size vocabulary that can still represent an infinite number of words.

Harbor 🪼

Explorer

Tokenization Strategies

Backlinks