Here’s my attempt using a modification of Jeremy’s lesson3-Rossman https://github.com/dtylor/dtylor.github.io/blob/master/kaggle/titanic/titanic_nn.ipynb. My submission score was pretty average at 77%. I tried also using the generated embeddings in a random forest regressor and achieved the same score.
@dtylor - Thank you for posting this. Beginner question, if you’re willing:
Accuracy of 87% using the embeddings calculated by the nn in the Random Forest Regressor.
But I couldn’t see in your gist how the nn-calculated embeddings got into
df is passed to
ColumnarModelData.from_data_frame() and therefore is available in
m is created by
md.get_learner(), and the
m.fit() is called. You then use
df directly (well, converted to numpy) as input to the random forest.
The embeddings must added to
df and their values set during training … does all that happen in-place in the
@shub.chat I think there’s potential in trained embeddings that you can’t (?) get from trees. Am I missing something?
Thanks for reading and your feedback. I am a beginner at this as well. The 87% accuracy of the random forest was based on the validation set, but the test prediction submission to Kaggle produced the exact same score as the neural net produced submission of 77.033%. The validation set wasn’t randomnly selected but represented the last 90 rows in the training set (a carryover from the time based selection for the rossman example), which may explain why it wasn’t representative of the test set.
You are correct; the code wasn’t properly using the embeddings from the nn for the random forest (which I still would like to try if possible). I’ll correct the comments.
There is certainly potential but as per what I have observed so far the potential is limited .The overall incremental benefit I observed specific on tabular data based classification problems was almost negligible.This can actually be a really great research area.We pick up all the old classification problems on kaggle and try and check if ANN using embeddings provide benefit and if yes how much?I still feel Deep neural nets are not a panacea to all problems specific to tabular data.
How did you do the cleaning?
Right now I’m working on this contest, my score using random forest is 0.74.
What hyperparameters did you use in the random forest?
I’ve been playing around with different methods for the Home Credit Default Risk Kaggle competition. With everything I’ve tried, boosted tree models have about 3-4% improved performance over fastai neural net models. I’ve tried playing around with different levels of dropout, adding layers, changing embedding matrix sizes, processing data in different ways and different training strategies. Optimizing these factors gets around 0.2-0.5% improvements, which isn’t going to close the performance gap much. To your point about unbalanced classes, this competition has a severe imbalance in training data, which may hurt neural net performance.
That said, my fastai structured data model outperforms other posted neural net solutions implemented in Keras/Tensorflow.
That’s interesting.Do you have any minimum number of records for which NN should work.I was trying it for Santander value prediction which has ~4500 training data points.The results are quite bad.
I don’t really know, but I would guess a lot. For example the Rossman challenge where NNs worked well had over a million rows in the final processed data set. The Rossman data also contained nonlinear features like time/seasonal relations to sales, which a NN should be better at understanding.
I think the Santander competition is particularly poorly suited to deep learning because all you have to go on is a tiny amount of sparse data.
Something I have been interested in trying for the Santander challenge is training an embedding matrix on the data (similar to lesson 5), then using the learned matrix to transform the data before passing it to a random forest/GBM. My hope is that the embedding matrix will learn latent features in the data, then pass them on to models better suited for small data sets, but there’s still the problem of having only a tiny amount of data to go on.
I attempted with Gboost Random Forest and Lr.
I got best score with RF of .78
Sounds really cool!Will really like to know how your experiment goes on this.
would you mind sharing how you modified the Rossman code to work with a classification problem
thanks a ton
So this is what I have going right now
I wouldn’t call the model working. It doesn’t really train, and I think it’s just converging to zero given how 92% of the test set is a single value. To use the structured data model for classification, I just used what was done in this notebook:
I’m trying to use f1 metric in
m.fit(lr, 3, metrics=[f1]) but it gives an error.
Did you try it in your notebook ?
The problem you were getting came from different shapes of targs (1) and preds (2), plus preds were log()'ed.
in metrics.py, add:
def recall_torch(preds, targs, thresh=0.5): pred_pos = torch.max(preds > thresh, dim=1) tpos = torch.mul((targs.byte() == pred_pos.byte()), targs.byte()) return tpos.sum()/targs.sum() def precision_torch(preds, targs, thresh=0.5): pred_pos = torch.max(preds > thresh, dim=1) tpos = torch.mul((targs.byte() == pred_pos.byte()), targs.byte()) return tpos.sum()/pred_pos.sum() def log_fbeta_torch(log_preds, targs, beta, thresh=0.5): assert beta > 0, 'beta needs to be greater than 0' beta2 = beta ** 2 preds = torch.exp(log_preds) rec = recall_torch(preds, targs, thresh) prec = precision_torch(preds, targs, thresh) return (1 + beta2) * prec * rec / (beta2 * prec + rec) def log_f1_torch(log_preds, targs, thresh=0.5): return log_fbeta_torch(log_preds, targs, 1, thresh)
The output looks promising (but I could be wrong - so please validate):
m.fit(lr, 5, cycle_len=1, metrics=[accuracy, log_f1_torch]) Epoch 100% 5/5 [00:00<00:00, 16.44it/s] epoch trn_loss val_loss accuracy log_f1_torch 0 0.530434 0.489619 0.733333 0.555556 1 0.518038 0.481288 0.766667 0.588235 2 0.50331 0.462756 0.788889 0.677966 3 0.491052 0.456119 0.766667 0.655738 4 0.47819 0.456757 0.788889 0.698413
edit: replaced with a cleaner version - just need to figure out better naming, see: https://github.com/fastai/fastai/issues/658
Great @stas Many thanks !
One remark : your
log_f1_torch did work in
m.fit(lr, 3, metrics=[log_f1_torch]) but not the functions
I made the small following changes in your definitions to make it worked.
Any chance to implement them in the Fastai library ?
def recall_torch(log_preds, targs, thresh=0.5): preds = torch.exp(log_preds) pred_pos = torch.max(preds > thresh, dim=1) tpos = torch.mul((targs.byte() == pred_pos.byte()), targs.byte()) return tpos.sum()/targs.sum() def precision_torch(log_preds, targs, thresh=0.5): preds = torch.exp(log_preds) pred_pos = torch.max(preds > thresh, dim=1) tpos = torch.mul((targs.byte() == pred_pos.byte()), targs.byte()) return tpos.sum()/pred_pos.sum() def fbeta_torch(log_preds, targs, beta, thresh=0.5): assert beta > 0, 'beta needs to be greater than 0' beta2 = beta ** 2 #preds = torch.exp(log_preds) rec = recall_torch(log_preds, targs, thresh) prec = precision_torch(log_preds, targs, thresh) return (1 + beta2) * prec * rec / (beta2 * prec + rec) def f1_score_torch(log_preds, targs, thresh=0.5): return fbeta_torch(log_preds, targs, 1, thresh)
I was trying to save doing
preds = torch.exp(log_preds) twice. But why did you need that change - did you call
precision_torch directly? If yes, then, yes, it’s probably the best to use your version. I thought they were just internal helper functions. You suggest that they are used directly.
And, yes, it’ll be in fastai soon. See https://github.com/fastai/fastai/issues/658. I will update this thread when this is done.
Yes. I need to display them with
m.fit as following :
m.fit(lr, 3, metrics=[precision_torch, recall_torch, log_f1_torch])