Learn.predict can only predict for a single row, not for the whole dataframe

Hi everyone, I am in one of the kaggle competition where it requires to load the testing data from a python generator. So I cannot add the testing file when creating the tabularlist and databunch.

data = (TabularList.from_df(market_train_df, cat_names=cat_names, cont_names=cont_names, procs=procs)
                       .random_split_by_pct()).label_from_df(cols=dep_var).databunch()

I use the above code to create data and learn. Then I use the following code to predict where y1 is the pandas data frame generated from the generator.

days = env.get_prediction_days()
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    x1,y1,z1 = predictions_template_df, market_obs_df, news_obs_df
    preds = learn.predict(y1)

However, the codes give the following error:

TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:
KeyError: 'volume'

I also tried preds = learn.predict(y1.iloc[9:10]), the same problem occurs.
Interestingly when I predict it individually, it works.

preds = learn.predict(y1.iloc[9])

I thought it was about the normalize in the preprocessing, but since it can be predicted individually, I have no idea what happens.

I am currently running this, but I know there must be a better way.

first=True
for i in range(4):
    pred = learn.predict(y1.iloc[i])[2]
    if first:
        preds = pred
        first = False
    else:
        preds = torch.cat((preds, pred))
preds = preds.view(-1, 2)   

Does anyone have an idea about it? Thanks a lot!

2 Likes

I’m not quite sure I’m understanding your question, but predict is for a single item

predict can be used to get a single prediction from the trained learner on one specific piece of data you are interested in.

You might be looking for pred_batch which can predict many rows at a time.

Return output of the model on one batch from ds_type dataset.
Note that the number of predictions given equals to the batch size.

Hi, thanks for your reply. yeldarb.

So I am trying to predict the testing data from a pandas dataframe.
Normally I would use learn.get_preds() but in this case, I cannot.
Because the testing dataset is generated inside a for loop on the go. In order word, I have to change the testing dataset over time.

So I used the predict function for a single item, but as mentioned above. Some error pops up and after searching it, I still don’t know what goes wrong.

When I use the predict function, it only generates one prediction and I have a lot of predictions to make.

For pred_batch, it looks like it has an upper limit of 64 predictions. Furthermore, I think it is not predicting based on my testing dataset, it is predicting on the validation set.

Am I wrong about it or my code isn’t correct? Really appreciate it!

data_test = (TabularList.from_df(y1, cat_names=cat_names, cont_names=cont_names, procs=procs)
                       .random_split_by_pct()).label_from_df(cols=dep_var).databunch()
preds = learn.pred_batch(data_test.train_dl)

Not really understand the part you said your test set is generated on the go.
Did you mean it is generated during training?
The question is, do you have a test set ready when you want to call pred.

If I understand correctly, you train you data, and somehow during training, you generated a test set that stored as dataframe. Before you call pred, your test set dataframe is ready.

If that’s the case, you can do learn.data.add_test(TabluarList.from_df(y1,…)

Then you can call preds = learn.get_pred(ds_type = DatasetType.test) //check my syntax in the doc, can’t tab :frowning:

1 Like

Here is what I meant:

If you want to check, this is the link to the kernel, but it is done for vision, but it should be similar to Tabular
https://www.kaggle.com/heye0507/fastai-1-0-with-customized-itemlist

hi thanks, heye0507
Here is what I did:

learn.data.add_test(y1)
preds = learn.get_preds(ds_type=DatasetType.Test)

but a new problem comes up

AttributeError: Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/conda/lib/python3.6/site-packages/fastai/data_block.py", line 615, in __getitem__
    if self.item is None: x,y = self.x[idxs],self.y[idxs]
  File "/opt/conda/lib/python3.6/site-packages/fastai/data_block.py", line 105, in __getitem__
    if isinstance(idxs, Integral): return self.get(idxs)
  File "/opt/conda/lib/python3.6/site-packages/fastai/tabular/data.py", line 125, in get
    codes = [] if self.codes is None else self.codes[o]
AttributeError: 'TabularList' object has no attribute 'codes'

I don’t know why, I have also tried to convert y1 dataframe into a tabularlist, but still no luck.

Hi Ben

I think you’re nearly there - that error looks like a problem with categories in the test data not matching the training and validation data.

I had a similar issue a while back and I got it to work by making sure that the test data was in my databunch before creating the learner. There is a nice minimal kaggle kernel for the current santander competition that may help you: https://www.kaggle.com/schock/santander-ootb-fast-ai-tabular-implementation

It was this one that pointed me in the right direction so hope it helps (and thank-you to whoever contributed that kernel).

-Peter

I didn’t work much with Tabular data, most of my work were done on image data.
So I can only try to point you a direction, I might not be right… sorry:disappointed_relieved:

But the base approach should be the same.

  1. when you add test-set, you can’t pass in a dataframe (at least I don’t think it will work, because underline library doesn’t know anything about preprocessing, for example, your cont_names, your cant_names, even it takes iter object)

  2. if you do the following:
    assume your leaner named learn

learn.data.add_test(TabularItemList.from_df(df_test,cat_names=cat_names,cont_names=cont_names,procs=procs)) //here when you create test TabularItemList, don’t split / label / transform / databunch / normalize. You just want a iteratable TabularItemList

preds = learn.get_preds(ds_type = DatasetType.Test)

You said it throw an error, could you please post it so we can take a look?

And make sure that your df_test has all the continuous / category variables as your df_train.

Really thank you. I did what you say exactly yesterday but it didn’t work.
I repeat it today and it works now!

Here is the code. Thanks everyone

datatest = TabularList.from_df(y1, cat_names=cat_names, cont_names=cont_names, procs=procs)
learn.data.add_test(datatest) 
preds, _ = learn.get_preds(ds_type = DatasetType.Test)
4 Likes

Good to know it worked :slight_smile:

And if you don’t mind me ask how did you make the code part in the post format like python code?

Thanks

Thanks Hao,
you can do it with four spaces or choose preformatted text from the tools
the hot key on mac is command shift c, maybe it’s ctrl shfit c on windows :grin: