I am trying to train a language model. After the first week’s lecture, I tried text classification. The first step of fine-tuning a pre-trained language model worked nicely. Then I was trying to fine-tune the language model. This worked if I use only one GPU. Next, I wanted to experiment with 2 GPUs. I used the following code
df_raw_very_small = df_raw.copy().sample(frac=0.01, random_state=42)
dls_lm_very_small_stack_128 = DataBlock(
blocks=TextBlock.from_df(text_cols='text', seq_len = 72,is_lm=True),
get_x=ColReader('text'), splitter=RandomSplitter(0.1, seed = 123).dataloaders, df_raw_very_small, path=path, bs=128, drop_last = True)
learn = language_model_learner(dls_lm_very_small_stack_128, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()])
learn = learn.to_fp16()
with learn.distrib_ctx():
learn.fit_one_cycle(1, 2e-2)
learn.save('lm_stack_exc_small')
I am getting the following error:
*Error when trying to collate the data into batches with fa_collate, at least two tensors in the batch are not the same size.*
*Mismatch found on axis 0 of the batch and is of type `LMTensorText`:*
* Item at index 0 has shape: torch.Size([72])*
* Item at index 64 has shape: torch.Size([3])*
*Please include a transform in `after_item` that ensures all data of type LMTensorText is the same size.*
When I run on one GPU, the sequence length for the last batch is 3. So I think what might be causing the error is that on the first GPU the sequence length for the last batch is 72 whereas on the second GPU it is 3. I am not able to find the solution. I tries using drop_last = True. I also tried using rank0_first for datalaoders.