DataLoader hanging while training a big dataset on Paperspace

Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES

Screen Shot 2020-10-15 at 6 09 54 PM

Hey guys, I have a large dataset of 400k text documents that I’m doing a language model on while reading Chapter 10 of the book.

The DataLoader setup process always seems to hang at some point. When I use top at the beginning of loading, it shows 6-8 python processes. When I use top after the freeze it says there are some number of ‘zombie processes’ which I’m assuming have thrown an error but are not terminating.

Is there a correct way to debug this sort of thing, something like a try/catch but for items in the DataLoader ? Currently I can’t get a stacktrace since it never exits. Maybe the tokenizer is hitting an error?

It looks like it’s not hanging completely, but at around 40% just grinds to an absolute slugcrawl - from 200 items/second to around 3. Then it jumps back and forth, then the kernel dies.