Tokenizer

Tokenizers convert texts to numerical data so that models can process.

2 objectives of a good tokenizer:

Types of tokenizers:

Type	Def	Pros	Cons
Word-based	Split on spaces/punctuations	Easy to implement & use High interpretability	High memory cost (huge vocabulary); Risk of too many unknown tokens
Character-based	Split into characters	Low memory cost (small vocabulary); Few unknown tokens	Low performance (for languages where characters are not meaningful); Large sequence length
Subword	Keep frequent words; Split rare words into meaningful subwords; Add a word separator token at the end of each word	Combine pros of both methods above	N/A

How to build a tokenizer:

Normalization: clean text
Pre-tokenization: split text into words
Modeling: convert words into a token sequence
Post-processing: add special tokens, generate attention mask & token type IDs

BPE (Byte-Pair Encoding)

Idea:

Get a unique set of words & word counts from corpus.
Get a set of base vocabulary of all characters in the corpus (preferably all ASCII characters).
Repeatedly merge the pair of tokens with max counts, till desired vocabulary size.

Usage: RoBERTa, DeBERTa, BART, GPT, GPT-2

Pros:

Cons:

Ignore context suboptimal splits for words with different meanings in different contexts

Idea:

Get a unique set of words & word counts from corpus.
Get a set of base vocabulary of all characters in the corpus, but add a prefix “##” to all the characters inside each word.
Repeatedly merge the pair of tokens with the following score formula, till desired vocabulary size. $$ \text{score}=\frac{\text{freq}(pair)}{\text{freq}(part_1)\times\text{freq}(part_2)} $$

Usage: BERT, DistilBERT, MobileBERT, Funnel Transformers, MPNET

Pros:

Prioritize rare words, where individual parts are less frequent in the vocabulary

Cons:

Only save the final vocabulary, not the merge rules label an entire word as [UNK] when any part is not in the vocabulary
Ignore context suboptimal splits for words with different meanings in different contexts

Idea:

Get a large vocabulary (via most common substrings in pre-tokenized words, or BPE on initial corpus with a large vocabulary size)

Usage: T5, ALBERT, mBART, XLNet