Titanic Kaggle Competition

ThomasMilan · April 10, 2019, 10:01am

I’m trying to use the FastAI library in the Titanic Kaggle competition (https://www.kaggle.com/c/titanic).

Here is what the training data looks like :

And here is how I set it up for the Learner :

procs = [FillMissing, Categorify, Normalize]

dep_var = 'Survived'
cat_names = ['Pclass', 'Sex', 'Cabin', 'Embarked']
cont_names = ['Age', 'SibSp', 'Parch', 'Fare']

data_test = TabularList.from_df(test, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)

data = (TabularList.from_df(train, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                       .split_by_rand_pct(0.2)
                       .label_from_df(cols=dep_var)
                       .add_test(data_test, label=0)
                       .databunch())

By doing so, I’m getting this error : “There are nan values in field ‘Fare’ but there were none in the training set. Please fix those manually.”. It makes sense but I thought using processors such as FillMissing should take care of this problem.

Can anyone clarify what processors do exactly and how they transform the training, validation and test set ?

Thanks

mcclomitz · April 10, 2019, 11:25pm

Hey Thomas,

Sorry I can’t help but I have a similar question to yours. I’m wondering how we set up our test set to have the same normalization and cat / cont variables. I tried to use tabular list and I tried to use .add_test() and neither worked.

When we run learn.predict(test_data), what are we putting in as the test_data parameter?

ThomasMilan · April 11, 2019, 10:01am

Hey,

Thanks for your reply !

From what I understand, learn.predict takes a single item as a parameter, so test_data is a single fastai item ItemBase that can be one of the provided 3 basic types or your own custom type (https://docs.fast.ai/tutorial.itemlist.html#Creating-a-custom-ItemBase-subclass)
I think you can create this item manually by calling something like :
test_data = Image(<Tensor>)

Still wondering how to add the test set so that it is processed just like the training set. I tried using add_test with a pandas DataFrame, but then I get an error when using learn.get_preds

mcclomitz · April 11, 2019, 10:09am

Yeah - I guess thats what I meant. How do I process the test dataframe so that its the same process as the training data frame.

I tried to use the Tabular transforms (https://docs.fast.ai/tabular.transform.html) to no avail.

Let me know how you go - i’ll try and work on it again after work.

ThomasMilan · April 11, 2019, 12:38pm

The error message is actually very clear, but to make it a bit more precise maybe add :
There are nan values [in your test set] in field ‘Fare’ but there were none in the training set. Please fix those manually.

Since the training set does not contain any NaN values, the processor is not doing anything. But, when it finds NaN values in the test set, it can’t process them because it didn’t find any before. It just doesn’t know what to do and how to deal with it

2 solutions then :

Replace manually every NaN values in the test set with a selected FillStrategy as the message suggests
Add a NaN value in the training set so that it knows how to deal with them

Then the processor will be applied the same way on both sets.

It also works using a TabularDataBunch with the parameter test_df (https://docs.fast.ai/tabular.data.html#TabularDataBunch.from_df) :

data = TabularDataBunch.from_df(path, train, dep_var, valid_idx=valid_idx, test_df=test, procs=procs, cat_names=cat_names, cont_names=cont_names)

Hope it helps

sanjabh · April 12, 2019, 12:29pm

Hi Thomas,
Pls avoid putting procs=procs in data_test and try with this…
data_test = TabularList.from_df(test, path=path, cat_names=cat_names, cont_names=cont_names)

mcclomitz · April 13, 2019, 5:26pm

Im getting an error now ‘PassengerId’ TypeError: an integer is required

my cont list is [‘PassengerId’, ‘Age’, ‘Fare’]
and my cat_list is:
[‘Pclass’, ‘Name’, ‘Sex’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Cabin’, ‘Embarked’, ‘Age_na’]
I ran the TabularList.from_df() as suggested by @sanjabh, but to no avail.

Any ideas. Its seems to be a problem with the continuous variables in the test data frame.

muellerzr · April 13, 2019, 5:52pm

Passenger ID should be a cat_var I believe due to it’s a label for one individual and not something plotted over time

ThomasMilan · April 13, 2019, 5:55pm

I think you can remove PassengerId from cont_names since it doesn’t have an impact in accuracy. Otherwise, when calling pd.read_csv, you can specify an attribute : dtype = {“PassengerId”: np.int32} because if you don’t I think it autoconverts it to a float

mcclomitz · April 14, 2019, 11:46am

When I remove PassengerId from the cont_var it then throws the same error with age. Its causing an error on anything that’s a float.

sanjabh · April 14, 2019, 6:02pm

Try with the followings…

dep_var = ‘Survived’
cat_names = [‘Pclass’, ‘Sex’, ‘Cabin’, ‘Embarked’]
cont_names = [‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
procs = [FillMissing, Categorify, Normalize]

Check fastai version if error still persists…
import fastai
fastai.version

I tried with fastai version 1.0.51 and is working fine.

mcclomitz · April 14, 2019, 7:29pm

Would you mind sharing a gist - I have checked the version and played with the cat and cont without success.

bernd.heidemann · April 14, 2019, 7:42pm

Here is a notebook for the Titanic Kaggle Competition, maybe it helps:

github.com

berndheidemann/notebooks_fastai/blob/master/titanic.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "fastai_titanic_data_from_github.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "metadata": {
        "id": "bamhH9JIlkYM",
        "colab_type": "code",

This file has been truncated. show original

mcclomitz · April 14, 2019, 9:19pm

Hey thanks Bernd that was a great example of putting the test set into the databunch, that really helped.

I gues my question is:

If we somehow dug up a new titanic persons data and we wanted to run a prediction against them, what would we need to do to get the learn.predict() to work.

I ran the model with the test set and I kept 1 row of the test set saved as prediction_df. Once I run the model I want to run the learn.predict against prediction_df but it isn’t letting me. What do I have to do to it so that I can predict against it?

Thanks for everyones help.

bernd.heidemann · April 14, 2019, 9:32pm

@mcclomitz this works in my notebook:

df_test.Fare=df_test.Fare.fillna(0)
test = TabularList.from_df(df_test, cat_names=cat_names, cont_names=cont_names, procs=procs)
predictions, *_ = learn.get_preds(test)

aaderemi · April 14, 2019, 10:23pm

If you have added the test set to the data, you can then call learn.get_preds (DatasetType.Test )

mcclomitz · April 15, 2019, 10:34am

So I have gotten the test set to work fine and that is all peachy.

My problem is that in real life we would then take this model and we would have it run against a completely new row of data and produce a prediction against that data. Imagine we found new data on titanic passengers and we wanted to predict whether they survived or not.

What I want to work out is how to run learn.predict() on a new row and produce a result.

I have taken 1 row of the test set and held it aside in predict_df and then after I have run the full training I want to run the model against predict_df. This is where I am stuck.

garyLeung · May 7, 2019, 9:43pm

Hi everyone,

I am working on the titanic project as well and my accuracy was stuck around 0.82. I used similar method as the link shared : https://github.com/berndheidemann/notebooks_fastai/blob/master/titanic.ipynb

I tried to vary the layers in learner model and lr , but it doesn’t really help. I just wonder if you guys have any better method to handle this as the leaderboard has score 1. Do you know if they use stacked ML models as suggested in one of the kernels (https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python)

are there any ways we can do better with fastai model?

ulat · July 8, 2019, 9:45am

Hi! I also want to improve my score. With a basic fast.ai tabular model I get a public score about 0.77***

Changing the number of training epochs and the layers sizes doesn’t really improve.

I added three boolean feature on the price-category of the fare.

I think the validation curve is really bad, isn’t it? Validation error only oscillates but doesn’t really go down while doing 200 epochs. (Similar curve with 100 or 20 epochs)