Member-only story
Memory Optimization Techniques for Training Large-Scale Models on Kaggle
Introduction
Training machine learning models on massive datasets can be challenging — especially if you’re working with limited memory resources like Kaggle’s ~30GB RAM limit. In this article, we’ll explore several practical memory optimization techniques that allowed us to train an XGBoost model on a 19GB dataset from the Jane Street competition without hitting memory errors.
1. Use Polars for Lazy Loading
Polars is an increasingly popular data manipulation library that offers high-performance processing and lazy evaluation. Instead of reading the full dataset into memory with a command like polars.read_parquet()
, we use:
train = pl.scan_parquet("train.parquet")
With scan_parquet
, the data remains lazy and is only loaded into memory when you explicitly call .collect()
. This means you can chain multiple transformations (e.g., select columns, filter rows) without fully materializing the dataset until you actually need it.
2. Select & Transform Only Required Columns
A large portion of memory usage comes from unused columns. By specifying only the columns that you genuinely need for training (features, target…