Text_classifier_learner predict all rows of a TextClasDataBunch

dori · November 19, 2018, 8:47am

Hi,

I’m trying to play on Kaggle using the lesson3-imdb notebook, but I’m confused by the final prediction step.

In lesson3-imdb, we end up with a learn.predict("I really loved that movie, it was awesome!") for a text_classifier_learner trained with a TextClasDataBunch.

I want to apply predict on all the rows of the test dataset of my databunch, but I’m not sure how to do so. The test ds appears to be of type fastai.data_block.LabelList

I also noticed that the test_ds object itself has a predict method, which doesn’t make sense.

Any idea how to predict all rows of a test databunch using the text_classifier_learning predict method ?

Related doc:

Code:

data = TextClasDataBunch.from_csv('data', 'train.csv', label_cols='target')
data.save()

data = TextClasDataBunch.load('data', bs=50) 
learn = text_classifier_learner(data, drop_mult=0.5)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(1, max_lr=slice(1e-3, 1e-1), moms=(0.8,0.7))
learn.save('first')

data_clas = TextClasDataBunch.from_csv('data', 'train.csv', test='test.csv', label_cols='target')
# predict ??? following command doesn't work
learn.predict(data_clas.test_ds)

dori · November 19, 2018, 8:59am

Got it with get_preds from the RNNLearner class (text_classifier_learner returning an RNNLearner instance).

y_test = learn.get_preds(data_clas.test_ds)
y_test = y_test[0].argmax(dim=1)

But now, I’m confused about how to save the results to csv.
Before, I’d join numpy & pandas arrays before saving them to csv.
Now, I have one pandas Series (my text column for uid), and one torch Tensor with the predictions.
It looks like tensors only work with numbers, so I cannot add my uid column in the tensor.
Is there no better way to deal with it than converting the tensor to python list, then converting again to a pandas series, so that I can build a panda dataframe from the 2 series ?

shoof · November 22, 2018, 4:52pm

Thanks for sharing this. The current doc didn’t seem to cover this simple yet common use case. From what I remember, your learn.get_preds() gives you a list of 2 tensors in your case, and you can directly add the predictions using y_test[1] as a pandas DataSeries to your test dataframe, no need to convert the tensor to a python list.

Assuming the test dataframe is called df_test, then

df_test = df_test.assign(prediction=pd.Series(y_test[1]))

would add a new column called prediction with the results, and you just need to save df_test with the required col names to csv with index=False