Classify multiple texts

DominikKu · October 15, 2020, 4:21pm

Hello,
I trained a classifier model with the NLP tutorial for chapter 10.
Now I want to predict and save the label for over 1.000.000 texts.Right now I am doing this for every single text with the load.predict(inputstring) function. It takes to much time.
Can I optimize the performance?

stefan-ai · October 17, 2020, 10:35am

Hi Dominik,

You could use batch prediction like this:

test_dl = learn.dls.test_dl(test_df)
preds = learn.get_preds(dl=test_dl)

Note that this rearranges the order of the input texts behind the scenes, so you’ll need to reorder the predictions using .get_idxs() on the test dataloader.

Or you can have a look at fastinference which sould speed up learn.predict() quite a bit.

DominikKu · October 26, 2020, 9:58am

Hey thanks for your answer.
Am I doing something wrong?

#Get the predictions

test_dl = learner.dls.test_dl(keys) #Load Stringlist
a,b, classificationResult = learner.get_preds(dl=test_dl, with_decoded=True)
predList = [element.item() for element in isSecRelevant.flatten()]
indices = test_dl.get_idxs()
indexRelPair = list(zip(predList, indices))

It predicts realy fast the dataframe, but when I compare some results with the validation set, the results are not so good, as with the learn.predict function.

stefan-ai · October 26, 2020, 11:07am

Doesn’t look like you’re doing anything wrong here. But from the code it’s not visible where isSecRelevant comes from. Can you post how it is created?

Just to make sure I understand correctly. If you use that code block on your validation set, you get worse results than when using learn.predict? If the results are much worse, there is probably still a problem with sorting somewhere.