TextDataLoaders creates invalid tensors (stack expects each tensor to be equal size but got...)

I am trying to generate a simple TextDataLoaders instance to train a language model based on a folder tree of plain text files.

I am using fastai 2.4 and the following, relatively trivial piece of code is failing:

dls = TextDataLoaders.from_folder(corpus_folder,
                                shuffle=False )        

During the call of fit_one_cycle I get the following error:

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in __torch_function__(cls, func, types, args, kwargs)
   1022         with _C.DisableTorchFunction():
-> 1023             ret = func(*args, **kwargs)
   1024             return _convert(ret, cls)

RuntimeError: stack expects each tensor to be equal size, but got [72] at entry 0 and [76] at entry 58

There has been a previous discussion about such an error in the following post but without any resolution: Fastai v2 text - #431 by chess

Does anybody understand what is going wrong here? Somehow the TextDataLoaders class is not correctly splitting and padding the content but I really don’t know where this happens and how this problem should be remedied.

Has anybody been able to train a language model based on TextDataLoaders.from_folder( … ) with recent releases of fastai?

Does anyone have an idea how I could work around the problem?

Many thanks in advance!

(P.S. I have been able to work around the problem for small datasets by first loading all data into a dataframe with a single column and then using this as basis for the language model. But with the amount of data that I need this approach is unfeasible and exhausts the RAM of the machines on Google Colab. Using a dataframe for this purpose causes multiple copies of the entire corpus to be held in RAM and this is simply unfeasible).

I found the cause and it was a bug on my side: some of the input files had a size of 0 bytes and this tripped up the batch creation.

The original cause was that I had a buggy custom tokenizer which in some cases returned an empty file. I then tried to go back to the standard spacy tokenizer but since TextDataLoaders.from_folder will pre-tokenize the input data into a separate _tok folder, the standard tokenization was not really used and the problem remained. Furthermore, I had some files with only a few bytes for which there simply was no meaningful content.

So the moral of the story is: check your tokenizer implementations, make sure you clear the _tok folder when the tokenizer implementation changes, and if there is data for which the tokenizer cannot produce a meaningful output, then those input files need to be deleted because TextDataLoaders.from_folder cannot handle files thata contain no tokens.

1 Like