I am trying to generate a simple TextDataLoaders instance to train a language model based on a folder tree of plain text files.
I am using fastai 2.4 and the following, relatively trivial piece of code is failing:
dls = TextDataLoaders.from_folder(corpus_folder, valid_pct=0.1, is_lm=True, shuffle=False ) dls.show_batch(max_n=3) learn.fit_one_cycle(15)
During the call of fit_one_cycle I get the following error:
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in __torch_function__(cls, func, types, args, kwargs) 1021 1022 with _C.DisableTorchFunction(): -> 1023 ret = func(*args, **kwargs) 1024 return _convert(ret, cls) 1025 RuntimeError: stack expects each tensor to be equal size, but got  at entry 0 and  at entry 58
There has been a previous discussion about such an error in the following post but without any resolution: Fastai v2 text - #431 by chess
Does anybody understand what is going wrong here? Somehow the TextDataLoaders class is not correctly splitting and padding the content but I really don’t know where this happens and how this problem should be remedied.
Has anybody been able to train a language model based on TextDataLoaders.from_folder( … ) with recent releases of fastai?
Does anyone have an idea how I could work around the problem?
Many thanks in advance!
(P.S. I have been able to work around the problem for small datasets by first loading all data into a dataframe with a single column and then using this as basis for the language model. But with the amount of data that I need this approach is unfeasible and exhausts the RAM of the machines on Google Colab. Using a dataframe for this purpose causes multiple copies of the entire corpus to be held in RAM and this is simply unfeasible).