Why are all of my model's predictions are the same?


(Joel Hobson) #1

I’m working on the Kaggle Titanic dataset. The goal is to predict whether a person survived (1) or died (0).

This is my code:

from fastai.tabular import *
from fastai.metrics import error_rate, accuracy
from pathlib import Path
import pandas as pd

path = Path('datasets/titanic')
training_path = path.joinpath('train.csv')
testing_path = path.joinpath('test.csv')
np.random.seed(1)
train = pd.read_csv(training_path)
test_df  = pd.read_csv(testing_path)
valid_idx = range(len(train)-178, len(train))
procs = [FillMissing, Categorify, Normalize]
dependent_var = 'Survived'
categorical = ['Pclass', 'Sex', 'Cabin', 'Embarked'] # Parch?
cont_names = ['Age', 'SibSp', 'Parch']#, 'Fare']

test = TabularList.from_df(test_df, cat_names=categorical, cont_names=cont_names, procs=procs)
data = (TabularList.from_df(train, cat_names=categorical, cont_names=cont_names, procs=procs)
        .split_by_idx(list(valid_idx))
        .label_from_df(cols=dependent_var)
        .add_test(test)
        .databunch())

learn = tabular_learner(data, layers=[20, 40], metrics=[accuracy, error_rate])
learn.fit_one_cycle(3, max_lr=(1e-01))

preds = learn.get_preds(DatasetType.Test)
preds[1]

When I output the predicted classes, they’re all 0. My error rate is fairly high (0.17), so even if the output is no good, I’d expect to see some variance in the predicted classes. What am I missing here?

Bonus question: I left ‘Fare’ out of the continuous variables because the fastai library was complaining about a NaN in that column of the test set when there was none in the dataset. Shouldn’t setting ‘procs’ on the test data set have cleaned that up for me?

Thanks!


(Karl) #2

Sounds like your model learned the easiest path to low error was to predict a single class for everything. What does the class distribution in the dataset look like?


(Joel Hobson) #3

I just took a quick look through the csv. Looks like it’s around 39% survived, 61% dead.


(Joel Hobson) #4

Interesting… It works if calculate my predictions one row at a time.

final_df = pd.DataFrame(columns=['PassengerId', 'Survived'])
i = 0
for datum in test:
    final_df.at[i, 'PassengerId'] = datum['PassengerId']
    final_df.at[i, 'Survived'] = learn.predict(datum)[0]
    i += 1

I found this thread about my exact problem which suggested that since the labels column for the test set was blank, it wouldn’t calculate the actual prediction. I’m not sure I understand this. Since the test set was part of the training databunch, shouldn’t it have generated a value for the label?