Confused about to (tabular pandas)

Would appreciate any help. I am at my wits end!

I use this to transform my dataframe df

to = TabularPandas(df, procs=[Categorify, FillMissing],
cat_names = cat_names,
cont_names = cont_names,
y_names=dep_col,
splits=splits)

then

dls = to.dataloaders(bs=32)
learn = tabular_learner(dls, metrics=rmse)

This all works ok. But then trying to do row by row predictions on new data (not seen by model) using:

for i in range(len(new_df)):
row, clas, probs = learn.predict(new_df.iloc[i])

I get WILD and crazy predicted values. Like -35 million when my train/valid df was within just 2.0-4.0

So I know that my new_df needs to get transformed in order for the model to make sense of it. But I don’t know where to insert the dls from the learner??

I know of this method for doing the entire df:
dl = learn.dls.test_dl(df, bs=32)
preds, _ = learn.get_preds(dl=dl)

But how on earth do I make that work for my original row-by-row snippet? I can’t find anywhere to put the dls into the learn.predict method.

Thanks a (data) bunch!!

Unfortunately, I don’t know of a way to apply the transformations of the training dataframe to the test dataframe on individual columns only without transforming or copying the original dataframe. However, it’s worth considering using the test_dl method on the entire test dataframe and then calling learn.predict (docs) to get predictions. That way, you can make individual predictions instead of all predictions at once.

Another way to transform only a single row of data in the test dataframe would be to call test_dl on a new dataframe, which only contains one row (the one you want to predict) and follow up with get_preds or predict on this new DataLoader. However, I don’t see the advantage over the first approach in this, it may depend on the context though.

To summarize, you could consider using learn.predict instead of learn.get_preds to make individual predictions. If you require to only transform a single row and to make predictions, you could implement a workaround where you create a new dataframe with the desired row and call test_dl on this row.

1 Like

Thanks MW for your insightful answer!

I would like to follow your method to use test_dl and then learn.predict, but I don’t know how to send test_dl to learn.predict.

Any ideas?

Thanks

To use the DataLoader from the test_dl function with learn.predict, you can select a row from this DataLoader and just pass it as the first argument to the learn.predict function.

1 Like

Ohhh ok. I get it now. Thank you!

1 Like

I think it only just clicked for me why my predictions weren’t working. Can anyone confirm this:

What the first 10 columns look like when i recreate dl on entire train/val dataset:

and what they look like when i only recreate them on the same data as a prediction-sized subset:

This second method is how i was creating the dl for all my prediction dfs, so I have my concerns i was doing it all wrong. The second method will produce bogus results, is that correct?

If so, what I don’t understand is how to add new rows into the dataloader (i.e. new entries on which to make predicitons), so that it forms part of the entire dl set, without already having the data available???

Update:
Having thought about this a bit more… adding a new entry requires recreation of the entire test/train df and dls. Does that mean I would have to retrain the entire model each time i want to predict new entries?