Challenge to Prediction on test data set with tabular data

Berkshire · February 14, 2019, 10:21pm

I have a table with about 10 categorical variables and 60 numerical and 3000 rows (categorised but not normalised). I extract 50 rows for a test set and set the valid set to be the first 200 rows, leaving 2750 for the training set. I then use data = TabularDataBunch passing valid_idx, test_df etc and get a data bunch that seems to be OK when I review its description by typing data.

The learn = tabular_learner seems to work and I can look at the learner and it seems sensible with the train, valid and test descriptions in order (test df has blank categories)

When I use learn.predict(df_test) I get KeyError: ‘HADP’ or if I use learn.predict(df.iloc[0]), I get AttributeError: Can only use .cat accessor with a ‘category’ dtype

I have read a few topics on this but can’t seem to move it forwards and wondered whether tabular is developed yet? I just want to run the learning model against a dataset (same structure and categorisation as the training dataset.

Thanks in advance for any ideas

peterwalkley · February 15, 2019, 3:07pm

Hi Mike

If I’ve understood you correctly, this looks like you have not removed the test set from the data frame passed to the TabularList.from_df() builder, so your test data is being included in the training data.

I’m also experimenting with tabular. In my case, I just want a prediction of the last row of the pandas frame, so I do this before training:

# Drop last
toPredict = tmpDf.iloc[-1]
tmpDf.drop(tmpDf.tail(1).index,inplace=True)

and then to predict after the training:

(predicted, _, _) = learn.predict(toPredict)

‘toPredict’ is a pandas.core.series.Series. I think if you want to do multiple predictions, you need to use the pred_batch method: https://docs.fast.ai/basic_train.html#Learner.pred_batch.

Prediction was a bit of an exercise left for the student in the rossman notebook.

Berkshire · February 16, 2019, 12:19pm

Peter - Thanks for the link to the tabular descriptor as I had missed it so I was able to use some of the information there to progress. I managed to get some form of output using learn.get_preds(ds_type=DatasetType.Test) as a tensor that I could then interpret even though the classes part of the tensor were set to zero.

I had split the master df into a df_test dataframe and a training dataframe (with valid_idx range of 0,92) and passed into the model.

I noted that the learn.lr_find() started to work but was interrupted on the second cycle for some reason but still able to plot the learner.

The actual prediction performance was not dissimilar to the RandomForest predictor I had previously built.

So I guess still a lot to learn but progress has been made and I have what I need.