Available/Unavailable: What is available/unavailable? How much is available/unavailable?
Annotation: Are they Annotated? How good are the annotations? How expensive to annotate the rest? How to resolve annotators’ disagreements? Is auto-annotation feasible (e.g., ChatGPT, rule-based generation, etc.)?
Privacy: What user data can we access? How do we obtain them? Can we use online/periodic data? Do we need anonymity?
Logistics: Where are the data (local/cloud)? What data structures? How big? What biases in it?
higher interpretability, fewer computational resources, faster training and easiness to debug/retrain
Why does it matter?
Model Selection: If a simpler model reaches comparable performance as a complex model, the simpler model wins.
Feature Selection: If a smaller feature set reaches comparable performance as a larger feature set, the smaller set wins.
Generalization: Overfitting is a core issue in ML. Simpler configs for hyperparameters are best start points to reduce overfitting and lead to better generalization for unseen data.
Ensemble methods: Aggregating simpler models reaches better performance than complex models in many cases.
Why deep learning if Occam’s Razor holds?
Big Data: Complex structures work better with larger data
Computational Resources: Thx Nvidia
Transformer: Thx Google
Transfer Learning: We can pretrain a gigantic general-purpose model and finetune it on specialized tasks to reach significantly better performance
Define Objective: improving click-through rates, increasing sign-up rates, etc.
Significance level: threshold of whether the observed difference between control & treatment is statistically significant
(i.e., Type I error): probability of rejecting
when it is true.
Common values: 0.05 and 0.01
A lower
makes it more challenging to detect difference.
Power: probability of rejecting
when it is false (i.e., the ability to detect a meaningful difference when it exists)
(i.e., Type II error): probability of not rejecting
when it is false.
Common value: 80%
A higher power requires larger sample sizes to achieve.
Create Variations: generate 2/more versions of the element we want to test
e.g., 2 different designs of a button: blue round button
control group A; green square button
treatment group B
Calculate Traffic: calculate required sample size per variation
(Baseline conversion rate): the occurrence rate of a desired event in the control group
(Minimum Detectable Effect): smallest difference in the conversion rate that we want to be able to detect as statistically significant
e.g., if
and we want a minimum improvement of
, then
Both groups should be statistically similar in terms of characteristics for a fair comparison
Splitting: randomly assign users into control & treatment groups
User-level: The user consistently experience the same variation throughout their entire interaction.
Pros: useful when the tested variations are expected to have a long-lasting impact on user experience, reducing potential biases or confusion due to inconsistency
Request-level: The system randomly determines the variation to be shown at each request made by the user, regardless of previous assignments.
Pros: useful for measuring immediate or session-specific effects of the tested variations, allowing us to capture any potential context-dependent or short-term effects
Measurement & Analysis:
Track & record user interactions, events, or conversions for both the control and treatment groups.
Compare the performance of control & treatment groups using statistical analysis.
Determine if there is a statistically significant difference in the metrics we are measuring between the tested variations.