Learn.predict can only predict for a single row, not for the whole dataframe

kachun1017 · February 27, 2019, 2:35pm

Hi everyone, I am in one of the kaggle competition where it requires to load the testing data from a python generator. So I cannot add the testing file when creating the tabularlist and databunch.

data = (TabularList.from_df(market_train_df, cat_names=cat_names, cont_names=cont_names, procs=procs)
                       .random_split_by_pct()).label_from_df(cols=dep_var).databunch()

I use the above code to create data and learn. Then I use the following code to predict where y1 is the pandas data frame generated from the generator.

days = env.get_prediction_days()
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    x1,y1,z1 = predictions_template_df, market_obs_df, news_obs_df
    preds = learn.predict(y1)

However, the codes give the following error:

TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:
KeyError: 'volume'

I also tried preds = learn.predict(y1.iloc[9:10]), the same problem occurs.
Interestingly when I predict it individually, it works.

preds = learn.predict(y1.iloc[9])

I thought it was about the normalize in the preprocessing, but since it can be predicted individually, I have no idea what happens.

I am currently running this, but I know there must be a better way.

first=True
for i in range(4):
    pred = learn.predict(y1.iloc[i])[2]
    if first:
        preds = pred
        first = False
    else:
        preds = torch.cat((preds, pred))
preds = preds.view(-1, 2)

Does anyone have an idea about it? Thanks a lot!

yeldarb · February 27, 2019, 4:10pm

I’m not quite sure I’m understanding your question, but predict is for a single item

predict can be used to get a single prediction from the trained learner on one specific piece of data you are interested in.

You might be looking for pred_batch which can predict many rows at a time.

Return output of the model on one batch from ds_type dataset.
Note that the number of predictions given equals to the batch size.

kachun1017 · February 27, 2019, 4:47pm

Hi, thanks for your reply. yeldarb.

So I am trying to predict the testing data from a pandas dataframe.
Normally I would use learn.get_preds() but in this case, I cannot.
Because the testing dataset is generated inside a for loop on the go. In order word, I have to change the testing dataset over time.

So I used the predict function for a single item, but as mentioned above. Some error pops up and after searching it, I still don’t know what goes wrong.

When I use the predict function, it only generates one prediction and I have a lot of predictions to make.

For pred_batch, it looks like it has an upper limit of 64 predictions. Furthermore, I think it is not predicting based on my testing dataset, it is predicting on the validation set.

Am I wrong about it or my code isn’t correct? Really appreciate it!

data_test = (TabularList.from_df(y1, cat_names=cat_names, cont_names=cont_names, procs=procs)
                       .random_split_by_pct()).label_from_df(cols=dep_var).databunch()
preds = learn.pred_batch(data_test.train_dl)

heye0507 · February 27, 2019, 8:00pm

Not really understand the part you said your test set is generated on the go.
Did you mean it is generated during training?
The question is, do you have a test set ready when you want to call pred.

If I understand correctly, you train you data, and somehow during training, you generated a test set that stored as dataframe. Before you call pred, your test set dataframe is ready.

If that’s the case, you can do learn.data.add_test(TabluarList.from_df(y1,…)

Then you can call preds = learn.get_pred(ds_type = DatasetType.test) //check my syntax in the doc, can’t tab

heye0507 · February 27, 2019, 8:04pm

Here is what I meant:

If you want to check, this is the link to the kernel, but it is done for vision, but it should be similar to Tabular
https://www.kaggle.com/heye0507/fastai-1-0-with-customized-itemlist

kachun1017 · February 28, 2019, 6:38am

hi thanks, heye0507
Here is what I did:

learn.data.add_test(y1)
preds = learn.get_preds(ds_type=DatasetType.Test)

but a new problem comes up

AttributeError: Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/conda/lib/python3.6/site-packages/fastai/data_block.py", line 615, in __getitem__
    if self.item is None: x,y = self.x[idxs],self.y[idxs]
  File "/opt/conda/lib/python3.6/site-packages/fastai/data_block.py", line 105, in __getitem__
    if isinstance(idxs, Integral): return self.get(idxs)
  File "/opt/conda/lib/python3.6/site-packages/fastai/tabular/data.py", line 125, in get
    codes = [] if self.codes is None else self.codes[o]
AttributeError: 'TabularList' object has no attribute 'codes'

I don’t know why, I have also tried to convert y1 dataframe into a tabularlist, but still no luck.

peterwalkley · February 28, 2019, 1:44pm

Hi Ben

I think you’re nearly there - that error looks like a problem with categories in the test data not matching the training and validation data.

I had a similar issue a while back and I got it to work by making sure that the test data was in my databunch before creating the learner. There is a nice minimal kaggle kernel for the current santander competition that may help you: https://www.kaggle.com/schock/santander-ootb-fast-ai-tabular-implementation

It was this one that pointed me in the right direction so hope it helps (and thank-you to whoever contributed that kernel).

-Peter

heye0507 · February 28, 2019, 10:23pm

I didn’t work much with Tabular data, most of my work were done on image data.
So I can only try to point you a direction, I might not be right… sorry:disappointed_relieved:

But the base approach should be the same.

when you add test-set, you can’t pass in a dataframe (at least I don’t think it will work, because underline library doesn’t know anything about preprocessing, for example, your cont_names, your cant_names, even it takes iter object)
if you do the following:
assume your leaner named learn

learn.data.add_test(TabularItemList.from_df(df_test,cat_names=cat_names,cont_names=cont_names,procs=procs)) //here when you create test TabularItemList, don’t split / label / transform / databunch / normalize. You just want a iteratable TabularItemList

preds = learn.get_preds(ds_type = DatasetType.Test)

You said it throw an error, could you please post it so we can take a look?

And make sure that your df_test has all the continuous / category variables as your df_train.

kachun1017 · March 1, 2019, 4:36am

Really thank you. I did what you say exactly yesterday but it didn’t work.
I repeat it today and it works now!

Here is the code. Thanks everyone

datatest = TabularList.from_df(y1, cat_names=cat_names, cont_names=cont_names, procs=procs)
learn.data.add_test(datatest) 
preds, _ = learn.get_preds(ds_type = DatasetType.Test)

heye0507 · March 1, 2019, 6:26am

Good to know it worked

And if you don’t mind me ask how did you make the code part in the post format like python code?

Thanks

kachun1017 · March 1, 2019, 3:34pm

Thanks Hao,
you can do it with four spaces or choose preformatted text from the tools
the hot key on mac is command shift c, maybe it’s ctrl shfit c on windows