TextDataLoaders creates invalid tensors (stack expects each tensor to be equal size but got...)

steinert · July 11, 2021, 8:27pm

I am trying to generate a simple TextDataLoaders instance to train a language model based on a folder tree of plain text files.

I am using fastai 2.4 and the following, relatively trivial piece of code is failing:

dls = TextDataLoaders.from_folder(corpus_folder,
                                valid_pct=0.1, 
                                is_lm=True, 
                                shuffle=False )        
dls.show_batch(max_n=3)
learn.fit_one_cycle(15)

During the call of fit_one_cycle I get the following error:

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in __torch_function__(cls, func, types, args, kwargs)
   1021 
   1022         with _C.DisableTorchFunction():
-> 1023             ret = func(*args, **kwargs)
   1024             return _convert(ret, cls)
   1025 

RuntimeError: stack expects each tensor to be equal size, but got [72] at entry 0 and [76] at entry 58

There has been a previous discussion about such an error in the following post but without any resolution: Fastai v2 text - #431 by chess

Does anybody understand what is going wrong here? Somehow the TextDataLoaders class is not correctly splitting and padding the content but I really don’t know where this happens and how this problem should be remedied.

Has anybody been able to train a language model based on TextDataLoaders.from_folder( … ) with recent releases of fastai?

Does anyone have an idea how I could work around the problem?

Many thanks in advance!
Christian

(P.S. I have been able to work around the problem for small datasets by first loading all data into a dataframe with a single column and then using this as basis for the language model. But with the amount of data that I need this approach is unfeasible and exhausts the RAM of the machines on Google Colab. Using a dataframe for this purpose causes multiple copies of the entire corpus to be held in RAM and this is simply unfeasible).

steinert · July 17, 2021, 7:53am

I found the cause and it was a bug on my side: some of the input files had a size of 0 bytes and this tripped up the batch creation.

The original cause was that I had a buggy custom tokenizer which in some cases returned an empty file. I then tried to go back to the standard spacy tokenizer but since TextDataLoaders.from_folder will pre-tokenize the input data into a separate _tok folder, the standard tokenization was not really used and the problem remained. Furthermore, I had some files with only a few bytes for which there simply was no meaningful content.

So the moral of the story is: check your tokenizer implementations, make sure you clear the _tok folder when the tokenizer implementation changes, and if there is data for which the tokenizer cannot produce a meaningful output, then those input files need to be deleted because TextDataLoaders.from_folder cannot handle files thata contain no tokens.