Wierd Prediction results

Hi, i am experiencing something wierd. Not sure if I have done everything correctly and in the right order.
Using the Titanic data from Kaggle. Using fastai v1+.

procs = [FillMissing, Categorify, Normalize]
cat_names = ['Pclass','Sex', 'Title', 'SibSp', 'Parch','Embarked','Cabin']
cont_names = ['Age', 'Fare']
dep_var = 'Survived'

data = (TabularList.from_df(train_df_new, procs=procs, cont_names=cont_names, cat_names=cat_names)
        .split_by_idx(valid_idx=range(len(train_df_new)-89,len(train_df_new)))
        .label_from_df(cols=dep_var)
        .add_test(TabularList.from_df(test_df_new, cat_names=cat_names, cont_names=cont_names, procs=procs))
        .databunch())

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
learn.fit_one_cycle(5, 1e-3)

preds, y = learn.get_preds(ds_type=DatasetType.Test)

My y results are all zeros(as in all dead) when predicting on the Test dataset!

y
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
..........]

But pred p values are very low

tensor([[9.0810e-01, 9.1897e-02],
            [6.5247e-01, 3.4753e-01],
            [9.5031e-01, 4.9685e-02],
            [9.0966e-01, 9.0339e-02],
            [5.7985e-01, 4.2015e-01],
            [8.8298e-01, 1.1702e-01],

What could be going on?

I was trying to solve a tabular classification task in the past few days but it looks like there is a bug with the instantiation of the TabularDataBunch => see the post by Sylvain here.

So I’m surprised that you can even get this far. Which version of fastai did you install?

I have 1.0.42 installed currently.

What I also find wierd is when I use show batch and specify the test dataset, there is a target column with all zeros. Is this correct, i mean before i even train anything?

data.show_batch(4)

|Pclass|Sex|Title|SibSp|Parch|Embarked|Cabin|Age|Fare|target|
|---|---|---|---|---|---|---|---|---|---|
|3|female|Miss|5|2|S|N|-0.9832|0.2700|0|
|3|male|Mr|0|0|S|N|-0.7607|-0.4879|0|
|1|female|Miss|0|0|S|B|-0.9832|1.0394|1|
|1|male|Mr|0|0|S|N|2.4287|-0.1254|0|

data.show_batch(4,ds_type=DatasetType.Test)
|Pclass|Sex|Title|SibSp|Parch|Embarked|Cabin|Age|Fare|target|
|---|---|---|---|---|---|---|---|---|---|
|3|male|Mr|0|0|Q|N|0.3889|-0.4892|0|
|3|female|Mrs|1|0|S|N|1.3161|-0.5053|0|
|2|male|Mr|0|0|Q|N|2.4287|-0.4531|0|
|3|male|Mr|0|0|S|N|-0.1673|-0.4730|0|

Also, can I use plain “Test” as ds_type instead of DatasetType.Test as I saw someplace on the forums?
I noticed that the results are different when I leave it blank, use “Test” and DatasetType.Test.

I think that the per default no labels for the test set are stored. But please make sure that this is also true in your case:

learn.data.test_ds.y

Therefore the y tensor does not contain the actual prediction classes. To get them you can compute the index of the maximum value in your prediction probabilities:

preds, _ = learn.get_preds(ds_type=DatasetType.Test, ordered=True)
pred_prob, pred_class = preds.max(1)
2 Likes

Yes the test set is always unlabeled in fastai (see here for more information and how to validate on a second validation set). The test set is there to quickly get predictions on unlabeled data.

Thanks! The y label is in fact blank as it should be.

Natalie’s code helped me get actual predictions out of my model.
But I got an error with ordered=True, it may not be included in the latest code versions?

ordered is only an argument in text, because the texts are sorted by their lengths (so the predictions come in a different order). In tabular, you predictions come in the same order as your dataframe, so that argument doesn’t exist.

1 Like

How do I make sure that only top-3 or top-2 predictions are considered during the training phase. This is in reference to multi-class or multi-label scenarios?