Training Language Model on multiple GPUs

I am trying to train a language model. After the first week’s lecture, I tried text classification. The first step of fine-tuning a pre-trained language model worked nicely. Then I was trying to fine-tune the language model. This worked if I use only one GPU. Next, I wanted to experiment with 2 GPUs. I used the following code

df_raw_very_small = df_raw.copy().sample(frac=0.01, random_state=42)

dls_lm_very_small_stack_128 = DataBlock(
    blocks=TextBlock.from_df(text_cols='text', seq_len = 72,is_lm=True),
     get_x=ColReader('text'), splitter=RandomSplitter(0.1, seed = 123).dataloaders, df_raw_very_small, path=path, bs=128, drop_last = True)

learn = language_model_learner(dls_lm_very_small_stack_128, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()])

learn = learn.to_fp16()

with learn.distrib_ctx():
    learn.fit_one_cycle(1, 2e-2)
             
learn.save('lm_stack_exc_small') 

I am getting the following error:

*Error when trying to collate the data into batches with fa_collate, at least two tensors in the batch are not the same size.*

*Mismatch found on axis 0 of the batch and is of type `LMTensorText`:*
*        Item at index 0 has shape: torch.Size([72])*
*        Item at index 64 has shape: torch.Size([3])*

*Please include a transform in `after_item` that ensures all data of type LMTensorText is the same size.*

When I run on one GPU, the sequence length for the last batch is 3. So I think what might be causing the error is that on the first GPU the sequence length for the last batch is 72 whereas on the second GPU it is 3. I am not able to find the solution. I tries using drop_last = True. I also tried using rank0_first for datalaoders.

I’m not sure I’ve ever tried doing multi-GPU training of a language model. In general, I strongly recommend avoiding multi-GPU training where possible, and instead run a different experiment on each GPU.

Thanks, I will keep that in mind. So basically as long as the data can fit in memory it is better to use a single GPU. Also if my raw data for language models has HTML tags, should I remove those before passing data to the fastai textloaders. In general, what kind of pre-processing you will recommend for the language model?

I found the answer after reading Chapter 10 of the book. Thanks for creating these wonderful resources.

1 Like

Hello. I have also hit the same error. I’ll continue down the path but figured I would report the issue here in case it helps to get this fixed. I can also create a GH issue if that is helpful.

The issue comes because of the last batch. What can happen is that the size of the last batches across GPUs can vary. If you have 1210 observations, and your batch size on each GPU is 200, then the size of the last batch for one GPU will be 200, and for the other GPU will be 10. So you have to somehow make sure that size of the last batch on each GPU is the same. An easy fix is to keep your observations multiple for 2 * n * batch size and drop the rest, n here is the number of GPUs.