I am trying to generate a simple TextDataLoaders instance to train a language model based on a folder tree of plain text files.
I am using fastai 2.4 and the following, relatively trivial piece of code is failing:
dls = TextDataLoaders.from_folder(corpus_folder,
valid_pct=0.1,
is_lm=True,
shuffle=False )
dls.show_batch(max_n=3)
learn.fit_one_cycle(15)
During the call of fit_one_cycle I get the following error:
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in __torch_function__(cls, func, types, args, kwargs)
1021
1022 with _C.DisableTorchFunction():
-> 1023 ret = func(*args, **kwargs)
1024 return _convert(ret, cls)
1025
RuntimeError: stack expects each tensor to be equal size, but got [72] at entry 0 and [76] at entry 58
There has been a previous discussion about such an error in the following post but without any resolution: Fastai v2 text - #431 by chess
Does anybody understand what is going wrong here? Somehow the TextDataLoaders class is not correctly splitting and padding the content but I really don’t know where this happens and how this problem should be remedied.
Has anybody been able to train a language model based on TextDataLoaders.from_folder( … ) with recent releases of fastai?
Does anyone have an idea how I could work around the problem?
Many thanks in advance!
Christian
(P.S. I have been able to work around the problem for small datasets by first loading all data into a dataframe with a single column and then using this as basis for the language model. But with the amount of data that I need this approach is unfeasible and exhausts the RAM of the machines on Google Colab. Using a dataframe for this purpose causes multiple copies of the entire corpus to be held in RAM and this is simply unfeasible).