Language Models

For consistency, notations strictly follow the original papers.

PLMs

This section includes basic transformer-based LM structures that are still widely used when LLMs are unavailable.

Ideas:

Unsupervised pretraining:

Objective: Autoregressive MLE on tokens $$ L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\cdots,u_{i-1};\Theta) $$
- : unlabeled corpus
- : context window size
- : param set
Structure: Transformer decoders
- : token embedding matrix
- : position embedding matrix
- : context vector of tokens
- : #layers

Supervised finetuning:

Objective (basic): depends on the specific task, generally MLE if discriminative $$ L_2(\mathcal{C})=\sum_{(x,y)}\log P(y|x^1,\cdots,x^m) $$
- : labeled dataset
- : seq of input tokens
- : label
Objective (hybrid): include LM as auxiliary objective $$ L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda L_1(\mathcal{C}) $$
- : weight
- Pros:
  - improve generalization of supervised model
  - accelerate convergence

Ideas:

Larger (#params: 1.5B > 117M)
More diverse data
Task-agnostic: learn supervised downstream tasks without explicit supervision
Zero-shot capability

Ideas:

Ideas:

Input representation:

Unsupervised pretraining:

Masked Language Modeling (MLM): randomly mask some words (15% in original experiment) and predict them.
- Problem: [MASK] token does not exist in downstream tasks
- Solution: further randomness - if a token is chosen to be masked, replace with
  - [MASK] 80% of the time
  - random token 10% of the time
  - itself 10% of the time
Next Sentence Prediction (NSP): predict whether sentence , in order to understand relationships between sentences.
- When selecting training samples,
  - 50% of the time ’s next sentence, labeled as IsNext.
  - 50% of the time ’s next sentence, labeled as NotNext.
  - Predict on [CLS] head.

Ideas:

Enhanced BERT pretraining:
- Dynamic masking: generate masking when feeding a seq to the model, instead of using same masking from data processing.
- Remove NSP: remove NSP loss + use FULL-SENTENCES (packing seqs from multiple docs)
- Large mini-batches: