Text Classification and pre-formatting

morgan · September 7, 2020, 1:57pm

Sharing the TextDataloader docs here in case you haven’t seen them: Text data – fastai

The book breaks down the key pre-processing elements to show you what needs to be done, TextDataloaders is a convenient class that does it all for you. Once you get more familiar with NLP you can break out and modify the various processing it does if you like. You can see the source code here if you’re curious about what its doing under the hood.

The preprocessing steps would be stored in your dataloader, you should be able to access the dataloader with learner.dls

You can pass new text documents to your learner as a test dataloader (test_dl), which you can then do inference on

A general rule of thumb is to continue to train until your validation loss starts to go up, meaning that your model is really just memorizing your training set and cannot generalize at all to the validation set. Having said that, if your validation set is very dissimilar to “real world” data then there is a risk that you will not get the best results. Creating a strong validation set is critical to properly assessing your models performance.

You would be looking at your metric choice here, for example if you have multiple classes (e.g. left, centre, right-wing) you can use the accuracy_multi metric which has a thresh option. If you want your model to tell your when its unsure (e.g. predict as “unknown”) then you’ll find some good discussion below:

here: Handle data that belongs to classes not seen in training or testing - #38 by mrfabulous1

and here: Lesson 9 Discussion & Wiki (2019) - #511 by jcatanza