Member-only story

Memory Optimization Techniques for Training Large-Scale Models on Kaggle

Introduction

3 min readDec 24, 2024

Training machine learning models on massive datasets can be challenging — especially if you’re working with limited memory resources like Kaggle’s ~30GB RAM limit. In this article, we’ll explore several practical memory optimization techniques that allowed us to train an XGBoost model on a 19GB dataset from the Jane Street competition without hitting memory errors.

1. Use Polars for Lazy Loading

Polars is an increasingly popular data manipulation library that offers high-performance processing and lazy evaluation. Instead of reading the full dataset into memory with a command like polars.read_parquet(), we use:

train = pl.scan_parquet("train.parquet")

With scan_parquet, the data remains lazy and is only loaded into memory when you explicitly call .collect(). This means you can chain multiple transformations (e.g., select columns, filter rows) without fully materializing the dataset until you actually need it.

2. Select & Transform Only Required Columns

A large portion of memory usage comes from unused columns. By specifying only the columns that you genuinely need for training (features, target…

Memory Optimization Techniques for Training Large-Scale Models on Kaggle

Introduction

1. Use Polars for Lazy Loading

2. Select & Transform Only Required Columns

Written by Abdur Rehman Khan

No responses yet