Tokenizer

Tokenizers convert texts to numerical data so that models can process.

2 objectives of a good tokenizer:

  • meaningful representation (i.e., effectiveness)
  • smallest representation (i.e., efficiency)

Types of tokenizers:

TypeDefProsCons
Word-basedSplit on spaces/punctuationsEasy to implement & use
High interpretability
High memory cost (huge vocabulary);
Risk of too many unknown tokens
Character-basedSplit into charactersLow memory cost (small vocabulary);
Few unknown tokens
Low performance (for languages where characters are not meaningful);
Large sequence length
SubwordKeep frequent words;
Split rare words into meaningful subwords;
Add a word separator token at the end of each word
Combine pros of both methods aboveN/A

How to build a tokenizer:

  1. Normalization: clean text
  2. Pre-tokenization: split text into words
  3. Modeling: convert words into a token sequence
  4. Post-processing: add special tokens, generate attention mask & token type IDs

 

BPE (Byte-Pair Encoding)

Idea:

  1. Get a unique set of words & word counts from corpus.
  2. Get a set of base vocabulary of all characters in the corpus (preferably all ASCII characters).
  3. Repeatedly merge the pair of tokens with max counts, till desired vocabulary size.

Usage: RoBERTa, DeBERTa, BART, GPT, GPT-2

Pros:

  • Guarantee no [UNK]
  • Flexible vocabulary management
  • Balance words and characters

Cons:

  • Ignore context mathematical expression or equation suboptimal splits for words with different meanings in different contexts

 

WordPiece

Idea:

  1. Get a unique set of words & word counts from corpus.
  2. Get a set of base vocabulary of all characters in the corpus, but add a prefix “##” to all the characters inside each word.
  3. Repeatedly merge the pair of tokens with the following score formula, till desired vocabulary size. $$ \text{score}=\frac{\text{freq}(pair)}{\text{freq}(part_1)\times\text{freq}(part_2)} $$

Usage: BERT, DistilBERT, MobileBERT, Funnel Transformers, MPNET

Pros:

  • Prioritize rare words, where individual parts are less frequent in the vocabulary

Cons:

  • Only save the final vocabulary, not the merge rules mathematical expression or equation label an entire word as [UNK] when any part is not in the vocabulary
  • Ignore context mathematical expression or equation suboptimal splits for words with different meanings in different contexts

 

Unigram

Idea:

  1. Get a large vocabulary (via most common substrings in pre-tokenized words, or BPE on initial corpus with a large vocabulary size)

Usage: T5, ALBERT, mBART, XLNet