Some Baselines for other Tabular Datasets with fastai2

Hey Everyone,

As I’m wanting to try out various new tabular models, I keep drawing to the same 2 datasets we use without much variation (ADULTs and Rossmann), so I’ve constructed a baseline (of sorts) in which fastai can be compared with. The repository is here including a notebook showing how I achieved these baselines here. The baselines themselves were taking from the TabNet paper, which was published in September of 2019. The goal is to present three things: the model, it’s accuracy, and the number of total parameters in the model. Also, as a request, if you do work with these try to post your feature importance as well, as there could be very interesting developments to where everyone’s model leans :slight_smile:

Onto the baselines:

The three proposed datasets are:

  • Poker Hand Induction

    • Challenge: Successfully identify the rank of the current hand based on suit (categorical) and rank (numerical) for the five cards in your hand (total of 10 variables).

    • Rankings (including fastai2):

      Model Test Accuracy (%)
      Decision Tree 50%
      Multi-layer perceptron 50%
      Deep Neural Decision Tree 65.1%
      TabNet 99.10%
      fastai2 99.48%
  • Sarcos Robotics Arm Inverse Dynamics

    • Challenge: Map 21D space components to successfully estimate the torque.

      • Rankings: (fastai won this one!)
      Model Test MSE Number of Parameters
      Random Forest 2.39 16.7K
      Stochastic Decision Tree 2.11 28K
      Multi-Layer Perceptron 2.13 0.14M
      Adaptive Neural Tree Ensemble 1.23 0.60M
      Gradient Boosted Tree 1.44 0.99M
      TabNet-S 1.25 6.3K
      TabNet-M 0.28 590K
      TabNet-L 0.14 1.75M
      fastai2 0.038 530K
  • Higgs Boson

    • Challenge: Using simulated data with features characterizing events detected by ATLAS, classify events into “tau tau decay of a Higgs boson” versus “background.”
    • Rankings:
    Model Test Accuracy (%) Number of Parameters
    Sparse evolutionary trained multi-layer perceptron 78.47 81K
    Gradient boosted tree - S 74.22 120K
    Gradient boosted tree - M 75.97 690K
    Multi-layer perceptron 78.44 2.04M
    Gradient boosted tree - L 76.98 6.96M
    TabNet - S 78.25 81K
    TabNet - M 78.84 660K
    fastai2 76.94 530K

Hopefully this can help some of you keep track of reasonable goals with tabular models (and hopefully some of you beat these!)


Good afternoon muellerzr hope your having a beautiful day!

I read your results and some of the associated links in this post and fastai seems to perform well against TabNet.

The biggest difference in performance looks like the Poker Hand Induction Dataset.

Have you any theories or thoughts about this?

Cheers mrfabulous1 :smiley: :smiley:

Honestly I’m unsure as to what it could be. Poker hand is a particularly hard problem, hence why most models got <70%. In terms of performance I believe the Sarcos Robotics Arm Inverse Dynamics dataset was a bit more eye opening :wink: What surprises me about that one is there was no categorical features, something I would expect for the model to perform well


That’s a very interesting question.
Before I really tried Poker Dataset I thought that NN can easily handle this deterministic case. Oh boy, how wrong was I…
My best result so far is 73% (fastai v1), nowhere near to the 99-ish result :frowning:
My mind is split in half between: ‘it should be easy’ and ‘99% doesn’t look real without heavy feature engineering’

1 Like

Don’t worry. You’re not not going crazy. No one else have been able to match it, hence why I brought it down. (Even the TF implementation). Could you share your 73% one? That’s fantastic!

I’ll try to reproduce it tomorrow as it’s 1 am here now.
I’ve saved the learner (of 0.73), but I have some doubts on reproducibility as after an our of experimentation it just pushed 65-ish threshold and moved forward :slight_smile:
It’s still most of the times feels like Alchemy for me, rather than directional search :slight_smile:

1 Like

I still think it is, with FE as us adding in bits to the pot :wink:

Also @Pak for all their baselines, zero FE was used :slight_smile:

1 Like

99% is stopping to feel so impossible now.

I’ve achieved 91.7% accuracy with fastai v1
Here is the notebook

No feature engineering was used except the fact that all the columns I turned into categorical features.

I’ve acieved it by switching off regularization completely. Model did overfit heavily, but that didn’t matter for me as long as validation accuracy went up.

To check everything I’ve tested model with test data. I labeled it automatically as detecting hand from the cards can be done algorithmically (I just did not label staright and royal flash in test data, but it’s a very tiny percent of all the cases, should be insignificant in terms of accuracy).
Test dataset (100k rows) confirmed 91% accuracy


Awesome job @Pak

Here is the link to the code I mentioned earlier, @muellerzr!

I could get 99,48% with fastai2 for poker rule induction dataset (go to cell # 130 of the notebook). It was kind of a pain (lots of epochs), but I suspect it could get even better because the validation loss was still decreasing. I dont know if we could speed the training with some trick.

I managed to get 99,10% with TabNet, and also suspect it could also get better with more epochs.



Great work @fmobrj75! I’m going to quickly run it now (with possibly some modifications) and then I’ll give you some thoughts of why that could be working :slight_smile:

Great! Thanks!

@fmobrj75 if I had to guess, we’re able to extract more information out of having it as both a numerical (which it does suit for this) and as a categorical (inwhich it also suits). I’m running a quick comparision/reproducibility test and I’ll update with the results but here’s what i’m doing:

500 epochs total with a LR of 1e-1:
Adam + One Cycle
Ranger + Flat Cos
Mish + Adam
Mish + Ranger

Also each are run for 5 so their averages will be reported once done (this is important I think)

@fmobrj75 I wrote a callback to help keep track of our best accuracies. What I can first tell you right now is that I achieve on Adam ~99.46/8 around epoch 320/330 so not nearly as bad as the 5/600 you were doing :slight_smile:

I decided to keep it on just Adam, had a few issues with the learning rate on on ranger (note I was fitting for 400 epochs straight total). I did get a very tight grouping though:

Accuracy Mean/Std: 99.44/0.029%
Num Epochs Mean/Std: 299.4/27.68.
Total Computation Time to get there: 1 minute 36 seconds
Great work @fmobrj75 :smiley:

Here’s that notebook:

Quote Jeremy on what to call it:

“Branched covarient embeddings” AKA my categorical feature is also a numerical

“BREMCO: BRanched Embedded COvariates”



Fabio, Zachary, that’s really awesome.

And I’m starting to think what it tells us about TabNet (and maybe other specific architectures). Does this mean that blunt Fully Connected layers-net are still better (not worse) than new fancy architectures?

1 Like

Quite possibly. I’ll be looking into this the next few months or so, but it seems to be that way.

@Pak to add a bit more than that. These fancy new architectures also aren’t as efficient. TabNet required 2x training time per epoch for even more epochs than our architecture took, and had 2-16x the total parameters. So I’m not sold on it yet. There are obviously more to try, like DeepGBM but I feel like these simple fully connected models do the job more than just enough.

TabNet was different because we could explain our models ‘better’, but is this not what FI and dependency plots give us? What is the weakness right now? (Would love your thoughts on that @Pak)

More on the FI, if we wanted to see attention, we simply choose to look at different FI based on the respective ‘y’’s from a labeled dataset, no?

To be honest I just couldn’t get right throught what is the meaning of interpretability TabNet provide.
As I understood they tell us that they can determine feature importance. Ok, but it also can be achieved with different techniques.
Correct me if I’m wrong, they also provide what features are “fired up” at each step (and I think for each examlpe). This in theory could be a usefull tool for understanding how it take it decisions.
But to be honest for now I did not really understand how to ‘read’ their layer-feature-interpretation pictures

1 Like

From what I saw, the more “lights” a feature provides the more it was used. Which as we know, can also be done with feature importance. So, you’re not, that’s the same that I got. Just now people have a picture I think rather than a computation?