Training Language Model on multiple GPUs

I am trying to train a language model. After the first week’s lecture, I tried text classification. The first step of fine-tuning a pre-trained language model worked nicely. Then I was trying to fine-tune the language model. This worked if I use only one GPU. Next, I wanted to experiment with 2 GPUs. I used the following code

df_raw_very_small = df_raw.copy().sample(frac=0.01, random_state=42)

dls_lm_very_small_stack_128 = DataBlock(
    blocks=TextBlock.from_df(text_cols='text', seq_len = 72,is_lm=True),
     get_x=ColReader('text'), splitter=RandomSplitter(0.1, seed = 123).dataloaders, df_raw_very_small, path=path, bs=128, drop_last = True)

learn = language_model_learner(dls_lm_very_small_stack_128, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()])

learn = learn.to_fp16()

with learn.distrib_ctx():
    learn.fit_one_cycle(1, 2e-2)
             
learn.save('lm_stack_exc_small') 

I am getting the following error:

*Error when trying to collate the data into batches with fa_collate, at least two tensors in the batch are not the same size.*

*Mismatch found on axis 0 of the batch and is of type `LMTensorText`:*
*        Item at index 0 has shape: torch.Size([72])*
*        Item at index 64 has shape: torch.Size([3])*

*Please include a transform in `after_item` that ensures all data of type LMTensorText is the same size.*

When I run on one GPU, the sequence length for the last batch is 3. So I think what might be causing the error is that on the first GPU the sequence length for the last batch is 72 whereas on the second GPU it is 3. I am not able to find the solution. I tries using drop_last = True. I also tried using rank0_first for datalaoders.

I’m not sure I’ve ever tried doing multi-GPU training of a language model. In general, I strongly recommend avoiding multi-GPU training where possible, and instead run a different experiment on each GPU.

Thanks, I will keep that in mind. So basically as long as the data can fit in memory it is better to use a single GPU. Also if my raw data for language models has HTML tags, should I remove those before passing data to the fastai textloaders. In general, what kind of pre-processing you will recommend for the language model?

I found the answer after reading Chapter 10 of the book. Thanks for creating these wonderful resources.

1 Like