Hello all! I’ve gone through some of the basic examples in the book/course, and I thought I’d try building a text classifier instead of image classifier as my first example project - it’s brought up a few questions I was hoping people could help with! Apologies if they’re silly.
I’ve tried to do sentiment analysis on tweets to large UK accounts, and pick out which stand out as having traditional “right wing” sentiment based on their bios - eg, very pro Brexit, anti immigrant. I’ve picked a few hundred tweets, hand classified them and assigned that as my training set.
Code her in case anybody wants to browse my mess:
training -
https://colab.research.google.com/drive/1u6LGPk39ixWWqPrxSa6FP-eEYIk-aKFQ?usp=sharing
predicting and widgets - https://colab.research.google.com/drive/1Q60CXxckG96JRlvTEy1zIVV_d2F6iGHT?usp=sharing
It looks like the TextDataloader does lots of work “under the hood” to prepare your text - when I ask it to show_batch, I get a whole bunch of xs, and the book NLP example seems to suggest we should manually doing some of that work (tokenizing, adding vocab libraries, etc). Can someone explain how much pre-work it does I don’t see, and how much I need to do myself? The tutorial example doesn’t seem to do any at all, but the book chapter does loads!
When it comes to doing predictions later, are those “pre-processing” steps then stored in my learner? Eg, if I ask my learner to predict on plain text, will that work as anticipated?
In terms of training my model…I’m still not very clear on what the training process output actually means. So say I’ve run:
learn.fine_tune(4, 1e-2)
every epoch, both my training loss and valid loss are decreasing, and my accuracy goes up. Given that, is there any downside to just continuing to train? Do I risk over-fitting on “real” data that isn’t in my validation set?
Finally, when it comes to making predictions, is there any way of varying the “threshold” at which our model makes classification? I want it to be a little more flexible, but couldn’t spot it in the docs