Member-only story

Memory Optimization Techniques for Training Large-Scale Models on Kaggle

Introduction

Abdur Rehman Khan
3 min readDec 24, 2024

--

Training machine learning models on massive datasets can be challenging — especially if you’re working with limited memory resources like Kaggle’s ~30GB RAM limit. In this article, we’ll explore several practical memory optimization techniques that allowed us to train an XGBoost model on a 19GB dataset from the Jane Street competition without hitting memory errors.

1. Use Polars for Lazy Loading

Polars is an increasingly popular data manipulation library that offers high-performance processing and lazy evaluation. Instead of reading the full dataset into memory with a command like polars.read_parquet(), we use:

train = pl.scan_parquet("train.parquet")

With scan_parquet, the data remains lazy and is only loaded into memory when you explicitly call .collect(). This means you can chain multiple transformations (e.g., select columns, filter rows) without fully materializing the dataset until you actually need it.

2. Select & Transform Only Required Columns

A large portion of memory usage comes from unused columns. By specifying only the columns that you genuinely need for training (features, target…

--

--

No responses yet