Language Models

For consistency, notations strictly follow the original papers.

PLMs

This section includes basic transformer-based LM structures that are still widely used when LLMs are unavailable.

GPT

Ref: Generative Pre-Training

Ideas:

  • transformer decoders only
  • unsupervised pretraining on a diverse unlabeled corpus
  • supervised finetuning on each specific task
  • (in practice) [EOS] token as separator

Unsupervised pretraining:

  • Objective: Autoregressive MLE on tokens $$ L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\cdots,u_{i-1};\Theta) $$
    • mathematical expression or equation : unlabeled corpus
    • mathematical expression or equation : context window size
    • mathematical expression or equation : param set
  • Structure: Transformer decoders mathematical expression or equation
    • mathematical expression or equation : token embedding matrix
    • mathematical expression or equation : position embedding matrix
    • mathematical expression or equation : context vector of tokens
    • mathematical expression or equation : #layers

 

Supervised finetuning:

  • Objective (basic): depends on the specific task, generally MLE if discriminative $$ L_2(\mathcal{C})=\sum_{(x,y)}\log P(y|x^1,\cdots,x^m) $$
    • mathematical expression or equation : labeled dataset
    • mathematical expression or equation : seq of input tokens
    • mathematical expression or equation : label
  • Objective (hybrid): include LM as auxiliary objective $$ L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda L_1(\mathcal{C}) $$
    • mathematical expression or equation : weight
    • Pros:
      • improve generalization of supervised model
      • accelerate convergence

 

GPT-2

Ref: GPT-2

Ideas:

  • Larger (#params: 1.5B > 117M)
  • More diverse data
  • Task-agnostic: learn supervised downstream tasks without explicit supervision
  • Zero-shot capability

 

GPT-3

Ref: GPT-3

Ideas:

  • Even larger (#params: 175B > 1.5B)
  • Even more diverse data
  • Even more task-agnostic
  • Few-shot capability
  • Prompting

 

BERT

Ref: Bidirectional Encoder Representations from Transformers

Ideas:

  • transformer encoders only
  • bidirectional context in pretraining
  • (in practice) [CLS] token first, [SEP] token as separator

Input representation:

 

Unsupervised pretraining:

  • Masked Language Modeling (MLM): randomly mask some words (15% in original experiment) and predict them.
    • Problem: [MASK] token does not exist in downstream tasks
    • Solution: further randomness - if a token is chosen to be masked, replace with
      • [MASK] 80% of the time
      • random token 10% of the time
      • itself 10% of the time
  • Next Sentence Prediction (NSP): predict whether sentence mathematical expression or equation , in order to understand relationships between sentences.
    • When selecting training samples,
      • 50% of the time mathematical expression or equation ’s next sentence, labeled as IsNext.
      • 50% of the time mathematical expression or equation ’s next sentence, labeled as NotNext.
      • Predict on [CLS] head.

 

RoBERTa

Ref: Robustly Optimized BERT Approach

Ideas:

  • Enhanced BERT pretraining:
    • Dynamic masking: generate masking when feeding a seq to the model, instead of using same masking from data processing.
    • Remove NSP: remove NSP loss + use FULL-SENTENCES (packing seqs from multiple docs)
    • Large mini-batches:

 

 

LLMs

This section includes the widely used LLMs on the market.

GPT-3.5

GPT-4

Google

LaMDA

PaLM

Meta

OPT

LLaMA