Cramming: Training a Language Model on a Single GPU in One Day

I noticed a recent paper that might catch the interest of some DL practitioners here. Below is the arXiv listing, as well as a link to their nicely set up Github repository. They perform a number of experiments and explicitly note modifications (to the Transformer architecture, optimization procedure, and data preprocessing), including ones that don’t help – these are written in gray text in the paper. I haven’t read the work in great detail, but it seems to support the case for empirical scaling laws found by Kaplan et al. (2020).


Cramming: Training a Language Model on a Single GPU in One Day

Jonas Geiping, Tom Goldstein

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day?
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

[arXiv paper] - [Github code]

7 Likes

I would suggest also looking at DeepSpeed’s Zero library. They offer efficient offload to CPU/NVMe, layer sharding and a bunch of other techniques which pushes model capacity without requiring much changes to model architecture or reducing precision during training at the least.

1 Like

For anyone following this, NarrowBERT was recently published which claims to speed up pre-training by only calculating self-attention on the masked tokens, which could help train a model faster here.

Relatedly for encoders Karpathy has been working on nanoGPT which is comparable to base GPT-2 with 1 day of training on 8xA100 (which will eventually be covered in his Deep Learning zero-to-hero course).

1 Like