Structured Learner For Kaggle Titanic

I noticed the same thing and tried (unsuccessfully) fixing it by using the na_dict from test_df in my proper df. I worry that my problem is that I don’t actually understand what the NA values are, and have been unable to google it. Could you explain?

Glad I could help.

The best I’ve achieved to date is to get my NN’s performance equal to my tree-based performance.

1 Like

Yeah!And add to that the cost of using a GPU based ecosystem and migrating to a different product environment sometimes makes you think how much value will it have in the current business scenarios(Non vision and Non NLP).

@Jan and @mfosker I’ll give that a roll when I get home later. Thanks for the tip!

Here’s my attempt using a modification of Jeremy’s lesson3-Rossman https://github.com/dtylor/dtylor.github.io/blob/master/kaggle/titanic/titanic_nn.ipynb. My submission score was pretty average at 77%. I tried also using the generated embeddings in a random forest regressor and achieved the same score.

3 Likes

@dtylor - Thank you for posting this. Beginner question, if you’re willing:

You got

Accuracy of 87% using the embeddings calculated by the nn in the Random Forest Regressor.

But I couldn’t see in your gist how the nn-calculated embeddings got into df. df is passed to ColumnarModelData.from_data_frame() and therefore is available in md. Then m is created by md.get_learner(), and the m.fit() is called. You then use df directly (well, converted to numpy) as input to the random forest.

The embeddings must added to df and their values set during training … does all that happen in-place in the df dataframe?

Thanks!

@shub.chat I think there’s potential in trained embeddings that you can’t (?) get from trees. Am I missing something?

Thanks for reading and your feedback. I am a beginner at this as well. The 87% accuracy of the random forest was based on the validation set, but the test prediction submission to Kaggle produced the exact same score as the neural net produced submission of 77.033%. The validation set wasn’t randomnly selected but represented the last 90 rows in the training set (a carryover from the time based selection for the rossman example), which may explain why it wasn’t representative of the test set.

You are correct; the code wasn’t properly using the embeddings from the nn for the random forest (which I still would like to try if possible). I’ll correct the comments.
Thanks again!

There is certainly potential but as per what I have observed so far the potential is limited .The overall incremental benefit I observed specific on tabular data based classification problems was almost negligible.This can actually be a really great research area.We pick up all the old classification problems on kaggle and try and check if ANN using embeddings provide benefit and if yes how much?I still feel Deep neural nets are not a panacea to all problems specific to tabular data.

How did you do the cleaning?

Right now I’m working on this contest, my score using random forest is 0.74.

What hyperparameters did you use in the random forest?

I’ve been playing around with different methods for the Home Credit Default Risk Kaggle competition. With everything I’ve tried, boosted tree models have about 3-4% improved performance over fastai neural net models. I’ve tried playing around with different levels of dropout, adding layers, changing embedding matrix sizes, processing data in different ways and different training strategies. Optimizing these factors gets around 0.2-0.5% improvements, which isn’t going to close the performance gap much. To your point about unbalanced classes, this competition has a severe imbalance in training data, which may hurt neural net performance.

That said, my fastai structured data model outperforms other posted neural net solutions implemented in Keras/Tensorflow.

That’s interesting.Do you have any minimum number of records for which NN should work.I was trying it for Santander value prediction which has ~4500 training data points.The results are quite bad.

I don’t really know, but I would guess a lot. For example the Rossman challenge where NNs worked well had over a million rows in the final processed data set. The Rossman data also contained nonlinear features like time/seasonal relations to sales, which a NN should be better at understanding.

I think the Santander competition is particularly poorly suited to deep learning because all you have to go on is a tiny amount of sparse data.

Something I have been interested in trying for the Santander challenge is training an embedding matrix on the data (similar to lesson 5), then using the learned matrix to transform the data before passing it to a random forest/GBM. My hope is that the embedding matrix will learn latent features in the data, then pass them on to models better suited for small data sets, but there’s still the problem of having only a tiny amount of data to go on.

2 Likes

I attempted with Gboost Random Forest and Lr.
I got best score with RF of .78

1 Like

Sounds really cool!Will really like to know how your experiment goes on this.

1 Like

Hi Karl

would you mind sharing how you modified the Rossman code to work with a classification problem
thanks a ton

So this is what I have going right now

I wouldn’t call the model working. It doesn’t really train, and I think it’s just converging to zero given how 92% of the test set is a single value. To use the structured data model for classification, I just used what was done in this notebook:

thanks Karl

Hi @dtylor,

I’m trying to use f1 metric in m.fit(lr, 3, metrics=[f1]) but it gives an error.
Did you try it in your notebook ?