Tokenizers convert texts to numerical data so that models can process.
2 objectives of a good tokenizer:
meaningful representation (i.e., effectiveness)
smallest representation (i.e., efficiency)
Types of tokenizers:
Type
Def
Pros
Cons
Word-based
Split on spaces/punctuations
Easy to implement & use High interpretability
High memory cost (huge vocabulary); Risk of too many unknown tokens
Character-based
Split into characters
Low memory cost (small vocabulary); Few unknown tokens
Low performance (for languages where characters are not meaningful); Large sequence length
Subword
Keep frequent words; Split rare words into meaningful subwords; Add a word separator token at the end of each word
Combine pros of both methods above
N/A
How to build a tokenizer:
Normalization: clean text
Pre-tokenization: split text into words
Modeling: convert words into a token sequence
Post-processing: add special tokens, generate attention mask & token type IDs
BPE (Byte-Pair Encoding)
Idea:
Get a unique set of words & word counts from corpus.
Get a set of base vocabulary of all characters in the corpus (preferably all ASCII characters).
Repeatedly merge the pair of tokens with max counts, till desired vocabulary size.
Usage: RoBERTa, DeBERTa, BART, GPT, GPT-2
Pros:
Guarantee no [UNK]
Flexible vocabulary management
Balance words and characters
Cons:
Ignore context
suboptimal splits for words with different meanings in different contexts
WordPiece
Idea:
Get a unique set of words & word counts from corpus.
Get a set of base vocabulary of all characters in the corpus, but add a prefix “##” to all the characters inside each word.
Repeatedly merge the pair of tokens with the following score formula, till desired vocabulary size.
$$
\text{score}=\frac{\text{freq}(pair)}{\text{freq}(part_1)\times\text{freq}(part_2)}
$$