Classify multiple texts

Hello,
I trained a classifier model with the NLP tutorial for chapter 10.
Now I want to predict and save the label for over 1.000.000 texts.Right now I am doing this for every single text with the load.predict(inputstring) function. It takes to much time.
Can I optimize the performance?

Hi Dominik,

You could use batch prediction like this:

test_dl = learn.dls.test_dl(test_df)
preds = learn.get_preds(dl=test_dl)

Note that this rearranges the order of the input texts behind the scenes, so you’ll need to reorder the predictions using .get_idxs() on the test dataloader.

Or you can have a look at fastinference which sould speed up learn.predict() quite a bit.

2 Likes

Hey thanks for your answer.
Am I doing something wrong?

#Get the predictions

test_dl = learner.dls.test_dl(keys) #Load Stringlist
a,b, classificationResult = learner.get_preds(dl=test_dl, with_decoded=True)
predList = [element.item() for element in isSecRelevant.flatten()]
indices = test_dl.get_idxs()
indexRelPair = list(zip(predList, indices))

It predicts realy fast the dataframe, but when I compare some results with the validation set, the results are not so good, as with the learn.predict function.

Doesn’t look like you’re doing anything wrong here. But from the code it’s not visible where isSecRelevant comes from. Can you post how it is created?

Just to make sure I understand correctly. If you use that code block on your validation set, you get worse results than when using learn.predict? If the results are much worse, there is probably still a problem with sorting somewhere.