Using ULMfit to train language models

I have been trying to train a language model using my own text data. In particular, I wanted to replicate the project done here where this person used a neural network to generate Harry Potter after being trained all of the novels. I wanted to see if I could get better results using ULMfit.

However, I don’t seem to understand how to go about it. Based on the course videos, the model starts training off of the pre-trained ULMfit model, which is fine, but I can’t seem to successfully get my text data into the text bunch.

The video assumes that your data is in a CSV format, but I am using a large plaintext of textual information. I’ve been looking through the documentation and I can’t seem to find any methods on TextDataBunch that would allow me to load a plaintext file.

Converting the .txt file into CSV fails because the computer can’t understand the file anymore. Putting the file into a folder and using the ‘from_folder’ method doesn’t work because although it read the folder, I didn’t get anything that was inside the folder. I tried tokenizing the whole corpus I’m using with nltk and then loading the data with “from_tokens.” This seems to work until I create a learner and try running lr_find. Then, I get the following error:

Exception: [Errno 20] Not a directory: ‘hp_input.txt/models’
Can’t write to ‘hp_input.txt/models’, set learn.model_dir attribute in Learner to a full libpath path that is writable

Looking at the error, it sounds like fast.ai expects the data to be in an imagenet format. Is this correct? Does anyone have any idea how I should be going about this?

Thanks so much for your help!

Did you figure this out? If so, please let me know. I am dealing with plain text articles. I am thinking to write a script to enter each sentence of the text into a cell of the csv file.

Just use Pretrained=False when loading the architecture, and feed it with your own data!

1 Like