Re: Rossmann - How do we process and make predictions against the test set?

Going back through the lesson6-rossmann notebook and I noticed that it ends with model training and doesn’t include any reference code for making predictions against the test set.

In particular …

  1. How should we use the data block API to create a DataBunch (or perhaps just a single DataLoader) from test_df with the same pre-processing learned from the training set applied to it (e.g., apply the missing value, categorification, and normalization metadata we get from our training set to the test set)?

  2. Do the same learn.predict() mechanics work for tabular?

1 Like

Hey WG,
When you use a datablock API, before calling .databunch() you can add a test set (More here

For Rossman competition the call would be the following:

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)

Concerning predictions. I adjusted Radek’s starter code for quick draw challenge:

preds, _ = learn.get_preds(ds_type=DatasetType.Test)
key_ids = test_df['Id']
labels = np.exp(preds.numpy())
sub = pd.DataFrame({'Id': key_ids, 'Sales': labels[:,0]})

Thanks @mrandy. Your code works perfectly given a test set during training.

I was thinking more in a production environment where you might get a .csv file every week or so to process.

Check this

Quote: use load_empty method. Note that for now, transforms and normalization (aka pre-processing that you referenced above) aren’t saved inside the export file. This is going to be integrated in a future version of the library. For now, we pass the transforms we applied on the validation set, along with all relevant kwargs, and we normalize with the same statistics as during training.

The Rossmann lesson is missing the end part where we make predictions for the test set and those predictions are then exported into a csv file. That end part of the jupyter notebook would be really helpful for a beginner like me.

Did someone complete the Rossmann jupyter notebook and could they share their notebook so we can learn from it?

You will get more by fighting through it :slight_smile: I believe in you!

But to help you along. Here is how I did tabular predications for another competition.

preds,targs = learn.get_preds()

target_preds = preds[0][:,1]


df.to_csv('location', columns=['Identifier', 'DepVar'], index=False)

Thank you for you help.

And I know this will sound a bit like I’m trying to cut corners or am complaining, but it would just be nice to see a working sample of what I’m trying to learn if I’m doing it for the first time (and if I’m trying to do a course). I haven’t been able to find a good example of predicting continues variables.

With that said, thank you again for sharing your notebook. It seems to be predicting a categorical variable, but I hope it’s still similar to what I will have to do.

I’ve also encountered this problem and I all I came across with was to write some function that can apply trained tabular model to any new dataframe (apply in batches rather than row by row).
Here is my topic about it Some useful functions for tabular models
And here is the notebook for Rosmann case
But there is a catch (yes there’s always one) info on normalization is not stored in the learn object, so I had two split databunch creation process in to two phases (details in the post I’ve linked)
So this way you can test your new (production?) data without adding it as a test set
Hope this will help

@Pak Thank you very much for that notebook. I’m still trying to wrap my head around most of it, but I think it will be very useful.

I’m having a comparable problem.
As I understand, the model is build to predict the sales of the next day. Where could I adjust the model to predict a different time horizon, for example one month?