I’m working on the Kaggle Titanic dataset. The goal is to predict whether a person survived (1) or died (0).
This is my code:
from fastai.tabular import *
from fastai.metrics import error_rate, accuracy
from pathlib import Path
import pandas as pd
path = Path('datasets/titanic')
training_path = path.joinpath('train.csv')
testing_path = path.joinpath('test.csv')
np.random.seed(1)
train = pd.read_csv(training_path)
test_df = pd.read_csv(testing_path)
valid_idx = range(len(train)-178, len(train))
procs = [FillMissing, Categorify, Normalize]
dependent_var = 'Survived'
categorical = ['Pclass', 'Sex', 'Cabin', 'Embarked'] # Parch?
cont_names = ['Age', 'SibSp', 'Parch']#, 'Fare']
test = TabularList.from_df(test_df, cat_names=categorical, cont_names=cont_names, procs=procs)
data = (TabularList.from_df(train, cat_names=categorical, cont_names=cont_names, procs=procs)
.split_by_idx(list(valid_idx))
.label_from_df(cols=dependent_var)
.add_test(test)
.databunch())
learn = tabular_learner(data, layers=[20, 40], metrics=[accuracy, error_rate])
learn.fit_one_cycle(3, max_lr=(1e-01))
preds = learn.get_preds(DatasetType.Test)
preds[1]
When I output the predicted classes, they’re all 0. My error rate is fairly high (0.17), so even if the output is no good, I’d expect to see some variance in the predicted classes. What am I missing here?
Bonus question: I left ‘Fare’ out of the continuous variables because the fastai library was complaining about a NaN in that column of the test set when there was none in the dataset. Shouldn’t setting ‘procs’ on the test data set have cleaned that up for me?
Thanks!