Data
Data pipelines in ML follow either one below:
- ETL: Extract
Transform
Load
- ELT: Extract
Load
Transform
This page follows ETL and only covers the basics because I’m not a data engineer.
Data Collection
Sources:
- Internal: transaction logs, customer databases, sensor data, operational data, etc.
- External: public datasets, third-party purchases, social media feeds, open-source platforms, user reviews, etc.
- Synthetic: auto-generated data to simulate real world; used when real data are unavailable due to privacy concerns/rarity
Procedure:
- Define requirements
- Establish scalable infra
- Quality assurance
- Continuous monitoring
Methods:
- Direct: surveys, observations, experiments, purchases, etc.; tailored for ML tasks; commonly used in R&D
- Indirect: scraping, database access, APIs, etc.; require preprocessing
- Crowdsourcing: gather a large group of people to generate data; commonly used for annotations
Challenges: volume, variety, veracity, velocity, ethics, etc.
Data Cleaning
Procedure (unordered):
- Handle missing data: either at random/not at random
- Deletion (only when the proportion is negligible)
- Imputation (only applicable to random missing, likely non-factual)
- Remove unwanted samples/features: duplicates, irrelevant samples/features, etc.
- Fix structural errors: typos, mislabels, inconsistency, etc.
- Filter unwanted outliers: statistical methods (Z-score, IQR [interquartile range], etc.), domain-specific filters
- remove if non-robust model, keep if robust model
- Handle text data: lowercase, punctuations, typos, stopwords, lemmatization (reduce word to root form), etc.
- Handle image data:
- Size: resizing, cropping, padding, etc.
- Color: grayscale conversion, histogram equalization (enhance contrast), color space transformation (e.g., RGB
HSV)
- Noise: Gaussian blur, median blur, denoising, artifact remmoval (e.g., text overlays, watermarks, etc.)
- File: ensure uniform file format
Challenges: scalability, unstructured data, info loss due to overcleaning, etc.
Data Imputation
Procedure (like EM):
- Estimate the missing data.
- Estimate the params for imputation.
- Repeat.
Types:
- Simple imputation: zero, majority, mean (usually best)
- assume no multicollinearity
- Complex imputation:
- Regression: fit missing feature on other features (assume multicollinearity)
- NOT necessarily better than simple imputations because assumptions can fail
- Indicator addition: add 0-1 indicators for each feature on whether this feature is missing (0: present; 1: absent)
- treat the fact “missing” as an informative value
- Cons: doubled feature size
- Category addition: add one more category called “missing” to represent missing values
- treat the fact “missing” as an informative value
- Pros: straightforward & much better than doubled numerical values
- Unsupervised Learning: ussed if there are lots of categories and/or features
- Regression: fit missing feature on other features (assume multicollinearity)
Data Transformation
Standardization
Pros:
- remove mean and/or scale data to unit variance (i.e.,
)
Cons:
- highly sensitive to outliers (because they greatly impact empirical mean & std)
- destroy sparsity (because center is shifted)
Normalization
Pros:
- scale individual samples to their unit norms
- can choose l1/l2/max as
Min-Max Scaling
Pros:
- can scale each
into a range of your choice
Cons:
- highly sensitive to outliers (because they greatly impact empirical min & max)
- destroy sparsity (because center is shifted)
Max-Abs Scaling
Pros:
- preserve signs of each
.- preserve sparsity
- scale each
into a range of
(
for neg entries,
for pos entries)
Cons:
- highly sensitive to outliers
Robust Scaling
Pros:
- robust to outliers
Cons:
- destroy sparsity (because center is shifted)
Quantile Transform
- Original form:
: quantile func (i.e., PPF [percent-point func], inverse of CDF)
: empirical CDF
- Uniform outputs:
- Gaussian outputs:
Pros:
- robust to outliers (literally collapse them)
Cons:
- distort linear correlations between diff features
- only work well with sufficiently large #samples
Power Transform
Yeo-Johnson Transform
: determined by MLE
Box-Cox Transform
- only applicable when
- only applicable when
Pros:
- map any data to Gaussian distribution (i.e., stabilize variance and minimize skewness)
- useful against heteroskedasticity (i.e., non-const variance).
- Sklearn’s PowerTransformer converts data to
by default.
Cons:
- distort linear correlations between diff features
Categorical features
- One-Hot Encoding: convert each category into a 0-1 feature (excluding a dummy)
- better for nominal data (i.e., no inherent order among categories)
- Label Encoding: convert each category into a numerical label
- better for ordinal data (i.e., inherent order among categories)
Data Loading
NOTE: Loading is significantly more complex IRL compared to school projects.
Procedure:
- Choose storage:
- Databases: SQL/Relational (only structured data), NoSQL (better for unstructured data)
- Data Warehouses: best for analytical tasks
- Data Lakes: raw storage of big data from various sources
- Cloud Storage
- Validate: check schema, data quality, integrity, etc.
- Format: encoding, batching (for big data), raw saving
- Load:
- Bulk loading: load data in large chunks
- minimize logging & transaction overhead
- require system downtime
- Incremental loading: load data in small increments
- use timestamps/logs to track changes
- minimize system disruption
- best for real-time data processing
- Streaming: load data continuously in real time
- Bulk loading: load data in large chunks
- Optimize (i.e., reduce data volume for faster execution)
- Indexing: set a primary/secondary index on features with unique/repeated values to retrieve them faster
- best for tables with low data churn (i.e., with fewer inserts/deletes/updates)
- Partitioning: divide a database/table into distinct sections/tables to query them independently
- best for storing older records
- Parallel Processing
- Indexing: set a primary/secondary index on features with unique/repeated values to retrieve them faster
- Handle errors
- Handle security: encryption, access control, etc.
- Verify: audit using test queries, reconcile loaded data with source data, etc.
Indexing
Why: faster retrieval
What: create quick lookup paths to access the data
How: Each Index stores values of specific columns & the location of corresponding rows. Indices are stored as B trees/B+ trees.
- B tree: balanced tree where each node contains multiple keys sorted in order & pointers to child nodes
- B+ tree: B tree & all records are stored at leaf level with leaf nodes linked to one another
Types:
- Single-column Indexes: created on only one column; used when queries frequently access/filter based on that specific column
- Composite Indexes (Multi-column Indexes): created on two/more columns; used when queries frequently filter/sort based on these columns in combination
- Unique Indexes: ensure all index values are unique; used as a primary key
- Full-text Indexes: allow complex queries on unstructured texts (e.g., search engines, document search)
- Spatial Indexes: used when queries do geospatial operations
Pros:
- fast retrieval
- auto sort (useful for fetching & presenting data)
- high time efficiency (reduce #disk accesses required during queries)
Cons:
- low space efficiency (require additional space to store indices)
- high maintenance complexity
Sharding (horizontal partitioning)
Why: scale databases horizontally
What: distribute data across multiple servers/locations, where each shard is an independent database and all shards form a single logical database
How: use a specific (set of) columns as a shard key that determines how the data is distributed across the shards
- Hash-based: use a hash function to determine which shard will store a given data row; useful for even data division
- Range-based: use ranges of shard key values to divide data; useful for numerical data division
- List-based: use predefined lists of shard keys to divide; useful for categorical data division
Where: web apps, real-time analytics, game/media services, etc.
Pros:
- horizontal scalability
- high availability
Cons:
- implementation complexity
- high maintenance complexity
Parallel Processing
Concepts:
- Core: A processor has multiple cores, each of which can execute instructions independently.
- Thread: A core can run multiple threads, each of which can execute sub-tasks independently.
- Memory:
- Shared memory architectures allow multiple cores to access the same memory space.
- Distributed memory architectures require communication between cores.
Types:
- Data Parallelism: divide data into smaller chunks & processing each chunk simultaneously on different cores
- Task Parallelism: execute different tasks of one program in parallel
Methods:
- Multithreading: C/C++ OpenMP, Python threading libraries, etc.
- GPGPU (General-Purpose computing on Graphics Processing Units): use CUDA and OpenCL to perform highly parallelized computing more efficiently than CPUs
Distributed Computing
NOTE: Distributed Computing
Parallel Processing
- Parallel processing runs on a single computer.
- Distributed computing runs on multiple independent computers (i.e., nodes).
Concepts:
- Networks: to connect multiple computers
- Horizontal scalability: the more machines the better performance (unless beyond capacity)
- Fault Tolerance: handle failures of individual nodes / network components without affecting the overall task
Methods:
- MapReduce: process big data with a distributed algorithm on a cluster
- Map: filter & sort data
- Reduce: perform a summary operation
- Distributed Databases: store data across multiple physical locations BUT appear as a single database to users
- Load Balancing: distribute workloads evenly across all nodes to maximize resource usage & minimize response time