How to predict large unseen tabular datas with trained model?

max666666 · April 24, 2019, 6:42am

I have trained a model using a tabular training data set.And now I want to use this model to predict some new data and get a answer.But how should I do?
I tried that read the unseen datas with test = pd.read_csv(),but when I called the method model.predict(test),it seem don’t work.
And I tried to use for-loop which is res = [learn.predict(test.iloc[i])[2] for i in range(test.shape[0])],
now it can run correctly but it too SLOW!!
I have more than seventy thousands unseen datas to predict,for-loop method costs more than six hours.
So how to predict large unseen datas quickly?
please help me,thanks!

Pak · April 24, 2019, 7:33am

I assume that there is a proper way to do that, probably involving something like .add_test, and making sure results wouldn’t be shuffled, but I ended up writing my own functions for that.

You can get prediction with get_cust_preds() from there

The only major thing is that you should split the process of data object creation into 2 phases (as overwise it’s impossible to get normalisation parameters used )

max666666 · April 24, 2019, 11:25am

thank you very much!

Poltigo · November 15, 2019, 6:17am

I have had the same issue, here is an elegant solution

gilf · October 29, 2020, 9:30am

Sometimes you cannot build the test dataset in advance, because you are getting the test data via an API, like in the Kaggle Riid competition.

I ended up creating a small function which predicts batches:

def predict_batch(self, df):
    dl = self.dls.test_dl(df)
    dl.dataset.conts = dl.dataset.conts.astype(np.float32)
    inp,preds,_,dec_preds = self.get_preds(dl=dl, with_input=True, with_decoded=True)
    return preds.numpy()

setattr(learn, 'predict_batch', predict_batch)

This function can be used like this:

%%time

sample_size = 2_000_000
preds = learn.predict_batch(learn, X[features].iloc[:sample_size])
roc_auc_score(X[target][:sample_size].values, preds)

I just wonder, why such a function is not in the fast.ai API. Keras for example supports this functionality out of the box.