Data parallel on a single GPU for small models

I have written a post where I play around with training multiple models at the same time on a single GPU. Sort of a Pytorch ddp but for one GPU.

Everyone talks mostly about the next 1 billion+ parameter model, but I have lots of small, even tiny, models which still takes a while to train due to large data volumes and inefficient use of the GPU.

I would like to be able to, simply define a model, a learner and just start training:

parallel_models = 10

drm = DataParallelEnsembleModule(n=parallel_models, modelfn=RegModel)

learn = Learner(
    loss_func=partial(xpemloss, lossfn=MSELossFlat()),
    opt_func=partial(SGD, mom=0.9),


Full implementation in the post - which btw. is written as a notebook and exported using nbdev_nb2md - not quite fastpages - but I believe it still counts. :slight_smile:


Just the other day I was thinking it would be really nice to be able to train multiple tabular models on different folds at the same time while looking at one CPU core being pegged but only 25% GPU usage. Even after switching to faster dataloaders than TabularPandas.

Tabular seems like a good application for this, as the models are small and the training can be more volatile across folds compared to image training.

1 Like

I’ll certainly have a look at using this with Tabular. Can’t see why it shouldn’t be able to work. There is still a bit of loose ends I won’t to tidy up and verify before moving on from my specific use case though.