Share your work here ✅

agr · November 23, 2018, 9:43pm

I built a language model on 26 works of fiction from Project Gutenberg. Here’s the notebook.

To train the model, I used paragraphs of text as observational units. Then I looked at how well the model was able to determine which work of fiction each of the paragraphs in the validation set belonged to. There was some class imbalance, so the final accuracy score of 71% may be a bit misleading. But looking at the confusion matrix, it does seem like the model performs well above chance, even for under-represented texts. The model’s most frequent source of error was due to misattribution of a different text written by the same author (e.g. one Charles Dickens book sounds a lot like another).

Two handy coding tips I learned:

plot_confusion_matrix() creates a plt object which can be manipulated directly (for relabeling ticks and axes, for example). This is nice because the method itself doesn’t return a Pyplot object - but once you know that the plt space is currently representing the confusion matrix plot, you can adjust it as needed.
The ClassificationInterpretation class in fastai.vision (which includes most_confused, plot_confusion_matrix, etc) can be used for text classifiers, too. Super handy.

^ Apologies if those were already noted in class, I’m a bit behind on my lectures.

Happy holidays to all!