I have been trying to train a language model using my own text data. In particular, I wanted to replicate the project done here where this person used a neural network to generate Harry Potter after being trained all of the novels. I wanted to see if I could get better results using ULMfit.
However, I don’t seem to understand how to go about it. Based on the course videos, the model starts training off of the pre-trained ULMfit model, which is fine, but I can’t seem to successfully get my text data into the text bunch.
The video assumes that your data is in a CSV format, but I am using a large plaintext of textual information. I’ve been looking through the documentation and I can’t seem to find any methods on TextDataBunch that would allow me to load a plaintext file.
Converting the .txt file into CSV fails because the computer can’t understand the file anymore. Putting the file into a folder and using the ‘from_folder’ method doesn’t work because although it read the folder, I didn’t get anything that was inside the folder. I tried tokenizing the whole corpus I’m using with nltk and then loading the data with “from_tokens.” This seems to work until I create a learner and try running lr_find. Then, I get the following error:
Exception: [Errno 20] Not a directory: ‘hp_input.txt/models’
Can’t write to ‘hp_input.txt/models’, set
learn.model_dirattribute in Learner to a full libpath path that is writable
Looking at the error, it sounds like fast.ai expects the data to be in an imagenet format. Is this correct? Does anyone have any idea how I should be going about this?
Thanks so much for your help!