I’m working on the Kaggle Titanic dataset. The goal is to predict whether a person survived (1) or died (0).
This is my code:
from fastai.tabular import * from fastai.metrics import error_rate, accuracy from pathlib import Path import pandas as pd path = Path('datasets/titanic') training_path = path.joinpath('train.csv') testing_path = path.joinpath('test.csv') np.random.seed(1) train = pd.read_csv(training_path) test_df = pd.read_csv(testing_path) valid_idx = range(len(train)-178, len(train)) procs = [FillMissing, Categorify, Normalize] dependent_var = 'Survived' categorical = ['Pclass', 'Sex', 'Cabin', 'Embarked'] # Parch? cont_names = ['Age', 'SibSp', 'Parch']#, 'Fare'] test = TabularList.from_df(test_df, cat_names=categorical, cont_names=cont_names, procs=procs) data = (TabularList.from_df(train, cat_names=categorical, cont_names=cont_names, procs=procs) .split_by_idx(list(valid_idx)) .label_from_df(cols=dependent_var) .add_test(test) .databunch()) learn = tabular_learner(data, layers=[20, 40], metrics=[accuracy, error_rate]) learn.fit_one_cycle(3, max_lr=(1e-01)) preds = learn.get_preds(DatasetType.Test) preds
When I output the predicted classes, they’re all 0. My error rate is fairly high (0.17), so even if the output is no good, I’d expect to see some variance in the predicted classes. What am I missing here?
Bonus question: I left ‘Fare’ out of the continuous variables because the fastai library was complaining about a NaN in that column of the test set when there was none in the dataset. Shouldn’t setting ‘procs’ on the test data set have cleaned that up for me?