Text Classification and pre-formatting

Hello all! I’ve gone through some of the basic examples in the book/course, and I thought I’d try building a text classifier instead of image classifier as my first example project - it’s brought up a few questions I was hoping people could help with! Apologies if they’re silly.

I’ve tried to do sentiment analysis on tweets to large UK accounts, and pick out which stand out as having traditional “right wing” sentiment based on their bios - eg, very pro Brexit, anti immigrant. I’ve picked a few hundred tweets, hand classified them and assigned that as my training set.

Code her in case anybody wants to browse my mess:
training -
https://colab.research.google.com/drive/1u6LGPk39ixWWqPrxSa6FP-eEYIk-aKFQ?usp=sharing

predicting and widgets - https://colab.research.google.com/drive/1Q60CXxckG96JRlvTEy1zIVV_d2F6iGHT?usp=sharing

It looks like the TextDataloader does lots of work “under the hood” to prepare your text - when I ask it to show_batch, I get a whole bunch of xs, and the book NLP example seems to suggest we should manually doing some of that work (tokenizing, adding vocab libraries, etc). Can someone explain how much pre-work it does I don’t see, and how much I need to do myself? The tutorial example doesn’t seem to do any at all, but the book chapter does loads!

When it comes to doing predictions later, are those “pre-processing” steps then stored in my learner? Eg, if I ask my learner to predict on plain text, will that work as anticipated?

In terms of training my model…I’m still not very clear on what the training process output actually means. So say I’ve run:
learn.fine_tune(4, 1e-2)

every epoch, both my training loss and valid loss are decreasing, and my accuracy goes up. Given that, is there any downside to just continuing to train? Do I risk over-fitting on “real” data that isn’t in my validation set?

Finally, when it comes to making predictions, is there any way of varying the “threshold” at which our model makes classification? I want it to be a little more flexible, but couldn’t spot it in the docs

Hey @AndreasThinks!

Sharing the TextDataloader docs here in case you haven’t seen them: Text data – fastai

The book breaks down the key pre-processing elements to show you what needs to be done, TextDataloaders is a convenient class that does it all for you. Once you get more familiar with NLP you can break out and modify the various processing it does if you like. You can see the source code here if you’re curious about what its doing under the hood.

The preprocessing steps would be stored in your dataloader, you should be able to access the dataloader with learner.dls

You can pass new text documents to your learner as a test dataloader (test_dl), which you can then do inference on

A general rule of thumb is to continue to train until your validation loss starts to go up, meaning that your model is really just memorizing your training set and cannot generalize at all to the validation set. Having said that, if your validation set is very dissimilar to “real world” data then there is a risk that you will not get the best results. Creating a strong validation set is critical to properly assessing your models performance.

You would be looking at your metric choice here, for example if you have multiple classes (e.g. left, centre, right-wing) you can use the accuracy_multi metric which has a thresh option. If you want your model to tell your when its unsure (e.g. predict as “unknown”) then you’ll find some good discussion below:

here: Handle data that belongs to classes not seen in training or testing - #38 by mrfabulous1

and here: Lesson 9 Discussion & Wiki (2019) - #511 by jcatanza

1 Like

Wow, that was hugely helpful. Thank you!

1 Like

Hi, I built tryramen.com to classify text using LLM AI without model training or pre-labelled data.

Got ideas on how I can make it better?