Questions on torchtext and padding as a regularizer

Hi @jeremy , recently I’ve been working on the Kaggle’s toxic comment dataset and I’m trying out different methods using pytorch and the fastai library. There are two things that got me confused:

  1. What’s your rationale behind building the new fastai/text.py module without using torchtext? I saw the post here that you agree with @Deb that there’s a problem with torchtext's sequential tokenization. Could you shed some light on this issue?

  2. I saw one of the competitors from the Kaggle’s toxic comment competition posted the following thought:

Padding as a regularizer

I built all the models in PyTorch. This gives you huge flexibility, but I struggled for a long time to replicate the results people were achieving with simple GRU models in keras. It turns out the biggest difference was sequence padding. My PyTorch code used variable length sequences (data split into buckets and then padded). Padding all sequences to the same length appears to have a significant regularising effect, so my best results were achieved by using a single or very small number of buckets.

Another competitor then replied the following message:

I remember reading a few months back somewhere on Keras’s github issues discussions, @jhoward commented on PyTorch vs Keras padding and the effect it has on regularization, as well as the effect of pre-post padding. I wish you had saw that, it would have saved you some trouble ;-).

I couldn’t find the Github issue, but with my Google-fu, it brought me to one of your tweets in January:

Turns out having lots of padding was somehow regularizing the model. It took more epochs to train, and ended with a better accuracy. I’ve now increased the dropout on the fixed model, and get the same performance.
Something interesting going on there…

Since then do you have any additional thoughts on why padding provides some regularizing effect? Also, how exactly do you pad your text to get the effect? Does this effect lead to rebuilding the fastai/nlp.py?

Many thanks!

Here I describe my experience. Prof. Jeremy would have a bigger picture. I used both torchtext and padding. But I pulled out and handled all tokenization/ numericalization/ padding outside torchtext. For tokenization I used spacy because it gave me good accuracy. Tokenization in general is an memory/time expensive process and it happens on your CPU. Here were the challenges:

  1. when a text field is split into tokens the memory requirement is huge until its numericalized.
  2. the faster the text to numbers happen the faster the code would be. It was taking over 1 hr and I had to bring it down to under 10mins.

My way of handling the runtime/memory using torch-text:

  1. build the vocab ahead of time on a subset (20% = 300K examples) using fields. doing that on 1.5million was not needed.
  2. use multi-processing to apply the tokenization and vocab. Multi-threading does not utilize all cores and the individual df operations are already multi-threaded.
  3. keep memory low: define classes to keep track of objects in memory. And use gc.collect()
  4. take care of padding etc: since I have numbers instead of text I modified language model and final NN code so that padding is a number.
  5. then apply vocab to check if the language model makes sense.

It was challenging to handle tokenization and appplying vocab outside torchtext. Hope that helps.

1 Like

Thanks Deb! Many of your points make perfect sense and they are quite eye opening. Rarely I see discussion on the internet talking about writing efficient data loader.

By the way, would you mind sharing some of your code on how you handle the runtime/memory using torchtext? I’ve been meaning to try writing high performance code in Python and I’m quite new to some of the ideas that you’ve mentioned :slight_smile:

1 Like

Sure Alex. The topic you opened is a deep one :slight_smile: I would be happy to share the code. This is a code I was experimenting on a public dataset (from kaggle mercari price prediction) because it was similar to what I needed to implement for production (multiple sequence-fields). Unfortunately I have not added any descriptions and it has many proof of concept experiments. For the code in production I just combined classes similar to ProcessDataFrame and MixedTextDataset (and of course it’s on a different dataset). The 20 mins includes brand fixing which takes about half of the time. Hopefully I will clean this code over next 1-2 weeks.

1 Like

Thank you @Deb! Very useful!!!

1 Like

Lots of gems! Thanks!

1 Like