TextDataLoader for Language Model - Incorrect Length?

kmacdermid · November 30, 2021, 1:55pm

Hello,
I’m training an ULMFiT model for text classification, starting with a language model - but I think the length of data present in my TextDataLoader is incorrect.

I ran:

dls_lm = TextDataLoaders.from_df(ans,
                                 text_col=wandb.config.text_col,
                                 is_lm=True, 
                                 seq_len=wandb.config.seq_len
)

Using the DataFrame “ans” which has 101618 rows. I noticed while it was training that it had fewer batches than expected. It runs 459 batches, with batch size 64 (double checked), meaning it’s using just 29,376 samples for training. Am I misunderstanding the lengths and batch size values? Or is it throwing away data?

I checked and there are no nulls in the text field of the DataFrame I’m passing in.

kmacdermid · December 1, 2021, 6:44pm

For anyone else looking into this. The reason appears to be because the LMDataLoader strings all the data together into one big corpus and then runs across that. That’s probably clever to save memory. The following snippet is from fastai.text.data, and shows where it happens.

I’m assuming there are tokens to indicate the start and ends of entries so that the language model doesn’t try to predict the first word of the next entry from the last word of the previous.

ivangrov · December 6, 2021, 6:32am

Appreciate you sharing your solution.

kmacdermid · December 6, 2021, 11:54am

Thanks Ivan, I always hate finding threads where the OP just comes back and says “fixed it” without saying what they did :).