NLP prediction - predict batch

bachsh · January 8, 2019, 11:16pm

Hey there,

I trained an NLP model with my own data and classes (116 classes with 94% accuracy on validation out of the box!). I’m using Fastai 1.0.39.

I want to get a dataframe for the test set with the original data (text fields before the transformations) along with the labels and the predictions. Here are the options that I tried:

Option 1
using learn.predict. This works but is extremely slow for running on large pieces of data. It’s better to load and run on batches…

Option 2
using learn.get_preds(DatasetType.Test).
Excerpt of code that I tried:

data_pred = TextDataBunch.from_df(
  path, train_df, valid_df, test_df,
  vocab=data_clas.vocab, classes=data_clas.classes,
  text_cols=DEFAULT_TEXT_COLS, label_cols=DEFAULT_LABEL_COL,
)
classifier = text_classifier_learner(data_pred, drop_mult=0.5)
classifier.load_encoder('fine_tuned_enc')
classifier.load('my_classifier')

preds = classifier.get_preds(DatasetType.Test)

Where data_clas is the DataBunch used for creating the classifier in the first place.
There seems to be a bug with running on the test set, because all I’m getting from the code above is the same prediction for all the lines

Option 3
Since running on the test set didn’t work for me, I tried inserting my test_df as the training set, and running learn.get_preds(DatasetType.Train), like so:

data_pred = TextDataBunch.from_df(
  path, test_df, valid_df, test_df=None,
  vocab=data_clas.vocab, classes=data_clas.classes,
  text_cols=DEFAULT_TEXT_COLS, label_cols=DEFAULT_LABEL_COL,
)
classifier = text_classifier_learner(data_pred, drop_mult=0.5)
classifier.load_encoder('fine_tuned_enc')
classifier.load('my_classifier')

preds = classifier.get_preds(DatasetType.Train, ordered=True)

Where, again, data_clas is the DataBunch used for creating the classifier in the first place.
Now I am getting what seems to be good predictions, but in mixed order; the ordered=True parameter doesn’t seem to be affecting anything. I’m pretty sure these are the right predictions because even though the labels and predictions don’t match, their histogram looks the same.

Option 4
This worked for me but it is very cumbersome. I ran through the DataLoader manunally and used learn.pred_batch, then combined everything together.

orig_classes, pred_classes = [], []
for b in progress_bar(data_pred.train_dl):
    pred_probs = classifier.pred_batch(batch=b)
    orig_classes += [data_clas.classes[x] for x in b[1]]
    pred_classes += [data_clas.classes[x.argmax()] for x in pred_probs]

So, what is the right way to go? How can I go over a dataset of 1M entries efficiently?

bachsh · January 10, 2019, 11:51pm

If anyone is watching this, the solution is to use

classifier.get_preds(DatasetType.Fix)`

The Fix dataset is the training dataset without shuffle. See this related post.