Tabular - Issue spliting to validation results in random #na# in data

Clive · February 5, 2019, 8:01pm

I have been trying to get a simple Tabular example working (Titanic Kaggle) but using the structure from Lecture 4 to build the databunch

data = (TabularList.from_df(traindf, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.random_split_by_pct(valid_pct=0.2, seed=43)
.label_from_df(cols=dep_var)
.databunch())

This results in a training set that is OK but the validation and the test set have some data replaced with #na# when I know that there should be real data there.

I have not been able to find any similar issue in the forum so assume I am doing something wrong in the call?

sgugger · February 6, 2019, 1:54am

If it’s #na#, it’s because the corresponding category wasn’t in the training set. You should be more careful in your splits if you want to avoid it.

Clive · February 6, 2019, 8:22pm

Thanks for your feedback. I am not sure how this is the case as I am randomly splitting one dataframe so it should have identical categories. I have tried various combinations of categorical/continuous variables and stepping through but I cannot see the error. It is not important as it is just a toy example but I would have liked to understand for knowings sake.