You might want to check out a pull request I just made to the fastai repo.
I’m not sure it does exactly what you are trying to do, but I added two new classes to nlp.py that allows you to build a LanguageModelData object from dataframes instead of multiple text files. Take a look. Not sure if it will get accepted, but its working for me on the spooky author dataset.
The two classes I’m proposing are:
ConcatTextDatasetFromDataFrames(torchtext.data.Dataset)
… and
LanguageModelDataFromDataFrames()
Works just like the lesson-4-imdb notebook but with dataframes.
@jeremy I coded things to not break anything, but may I suggest modifying LanguageModelData class to simply expose class methods, from_dfs
and from_text_files
, to build the ModelData object.