Here’s my attempt using a modification of Jeremy’s lesson3-Rossman https://github.com/dtylor/dtylor.github.io/blob/master/kaggle/titanic/titanic_nn.ipynb. My submission score was pretty average at 77%. I tried also using the generated embeddings in a random forest regressor and achieved the same score.
@dtylor - Thank you for posting this. Beginner question, if you’re willing:
Accuracy of 87% using the embeddings calculated by the nn in the Random Forest Regressor.
But I couldn’t see in your gist how the nn-calculated embeddings got into
df is passed to
ColumnarModelData.from_data_frame() and therefore is available in
m is created by
md.get_learner(), and the
m.fit() is called. You then use
df directly (well, converted to numpy) as input to the random forest.
The embeddings must added to
df and their values set during training … does all that happen in-place in the
@shub.chat I think there’s potential in trained embeddings that you can’t (?) get from trees. Am I missing something?
Thanks for reading and your feedback. I am a beginner at this as well. The 87% accuracy of the random forest was based on the validation set, but the test prediction submission to Kaggle produced the exact same score as the neural net produced submission of 77.033%. The validation set wasn’t randomnly selected but represented the last 90 rows in the training set (a carryover from the time based selection for the rossman example), which may explain why it wasn’t representative of the test set.
You are correct; the code wasn’t properly using the embeddings from the nn for the random forest (which I still would like to try if possible). I’ll correct the comments.
There is certainly potential but as per what I have observed so far the potential is limited .The overall incremental benefit I observed specific on tabular data based classification problems was almost negligible.This can actually be a really great research area.We pick up all the old classification problems on kaggle and try and check if ANN using embeddings provide benefit and if yes how much?I still feel Deep neural nets are not a panacea to all problems specific to tabular data.
How did you do the cleaning?
Right now I’m working on this contest, my score using random forest is 0.74.
What hyperparameters did you use in the random forest?
I’ve been playing around with different methods for the Home Credit Default Risk Kaggle competition. With everything I’ve tried, boosted tree models have about 3-4% improved performance over fastai neural net models. I’ve tried playing around with different levels of dropout, adding layers, changing embedding matrix sizes, processing data in different ways and different training strategies. Optimizing these factors gets around 0.2-0.5% improvements, which isn’t going to close the performance gap much. To your point about unbalanced classes, this competition has a severe imbalance in training data, which may hurt neural net performance.
That said, my fastai structured data model outperforms other posted neural net solutions implemented in Keras/Tensorflow.