Some Baselines for other Tabular Datasets with fastai2

And of course it’s a great thread!
We needed to show some love for tabular data :stuck_out_tongue:

2 Likes

Yes, there’s a bit of irony here, that we, humans, pack most of our data in tables (I mean that if you ask a random person what the ‘real’ data is: financial excel sheet or a picture of a dog, I think most will choose the first one) and yet it seems like we simply don’t know how to process this tabular data efficiently with NN, with any elaborated methods other than simple fully connected layers network :wink:

1 Like

And as for the problem hubert mentioned (dependent features), it seems like a very big deal in real data I’ve encountered with. In fact its sometimes just hard to find a single isolated feature (value of a feature that you can change to other value from this column without changing other columns as well). And that’s why I’m starting to think that more often than not even feature importance we should do not for a single column, but for pair (or more) of dependent columns as a whole (maybe we should make some correlation analysis first or use the domain knowledge) :frowning:

1 Like

@muellerzr if I understand you correctly you would like to check for independence of the variables by inspecting the dependence plot, correct? Well in some cases, yes. If we observe statistical dependence and we understand the variables enough to assume that there might be a causal link we could say that the assumptions are not fulfilled, but it’s not always the case. We could observe dependence, while in reality changing (as the thing here is about changing actively variables, setting, intervening - you name it) one variable would not affect the other, which some people call it spurious correlation and the assumptions are fulfilled. On the other hand we could see no dependence at all in our dataset, while the vars can be linked causally.

It is rather about causal relations between the variables than about statistical independence.

How to do causal discovery (discover causal links between the variables) is still an open question. In theory it is possible only up to some point, but in practice meta learning methods in some cases appear promising. Usually in practice it is easy to spot potential links between variables and we can refute the assumption, but we would like to rather keep it (i.e. assume independence)…

1 Like

But recording how each variable was used during training is doable, so I like the attention idea. Nevertheless fully connected NN should have that information, the question is how to extract it.

1 Like

What if say we do a super small random subset to start (say n=100-1000), and then we do one column at a time until all columns at a time to find the pair relationships? Or how high should we go? I say very small cause that’s a potential to be n^2-1 complexity. What’s too much? Would say 50% of the columns be fine? Lots of questions :slight_smile:

Timing shouldn’t be much of an issue as these are super fast models (that can be pushed further once Rapids is fully integrated). Let me know @pak or @hubert.misztela :slight_smile:

1 Like

Given a trained model f(x0,...,xn), we might be able to learn the feature importance by simulating the attention.

For each feature xi, we can associate an uninformative value mi (the mean or the mean of the embedding). We can then create a dampened feature xi' = ai*xi + (1-ai)*mi with the sum of all ai equal to 1 (this can be enforced by a sofmax).

We now maximize the accuracy of f(x0',...,xn') by optimizing the ai, these are our importances.
(that’s a way to ask to network to make a decision : which features can be dropped and which features bring information to the plate ?)

There is one hidden parameter here which is the fact that the ai sum to one : I see no particular reason to use 1 and not another number between 1 and n-1 (given n features).
Let’s call this parameter k, in a way it is what we think is the number of important features (my gut feeling would be that sqrt(n) should be a good default).

I think we also can have a hint on how pairs of columns are dependent to each other by comparing sum of permutation feature importance of each of two and a pair as a whole

1 Like

So the experiment would be to take say 5 overall features, run this pair-wise test, and see how they compare yes? If the sum is similar, then we can assume it as such. If not the we try something else? Perhaps akin to @nestorDemeure’s idea

What if I told you that even if you measure all of the combinations and they are statistically independent in your dataset you can’t be sure that there is no relationship between them? Why? Because it is only your dataset and there could be a specific distribution phenomenon (and I think it’s related to the domain shift problem recently brought up by @jeremy on Twitter https://twitter.com/jeremyphoward/status/1223305148182609920?s=20).
On the other hand if you observe that two variables are not independent it does not necessarily mean that you cannot manipulate them for simulation like SHAP.
Neither outcome is sure.

Of course if we want to learn more about the variables for model training purposes (being aware that the results can be misleading) that is ok, but I would not use it for real life use case model interpretation.

What could be done without the burden of independence assumptions is to use methods which don’t simulate interventions.

1 Like

Could you provide any examples of how to do so? :slight_smile:

@nestorDemeure this direction of thinking seems interesting.
So you would get as the output ai values, one for each feature.
Intuitively what you have proposed is what attention from TabNet does, but only at the feature level (no sample level explanation) and after training of the model, right?
The question is how would that summarize different distributions of importance (feature very important but only for a few samples vs frequently important but just a bit).
Would that be easier to test experimentally or analyze from the math point of few?:thinking:
Have you tried to implement that?

I’d suggest to move the conversation to Feature Importance in deep learning.

2 Likes

Ran some experiments with the Poker dataset using @fmobrj75 and @muellerzr’s setup using Ranger, Mish, and longer training. Achieved a new average valid accuracy high of 0.99576 with ReLU at 600 epochs:

Average Accuracy: 0.99576
Accuracy Std: 0.00046
Average Total Epochs: 447.20
Epochs Std: 85.08 

There appears to be diminishing returns for training longer, 800 epochs resulted with 0.99540.

Using Mish instead of ReLU in tabular_learner resulted in lower average scores but higher variance. 400 epochs had an average accuracy of 0.98461, but max of 0.995 and min of 0.951. Increasing epochs consistently resulted with better accuracy, 800 epochs had an average accuracy of 0.99432.

Gist with the Mish & ReLU results.

I didn’t have time to test with multiple runs with Ranger, Mish, and fit_flat_cos. Training with Ranger and Mish resulted with significantly worse generalization until I increased dropout. And even then lagged behind Adam and fit_one_cycle. RangerQH appeared to work better than Ranger.

Best Ranger and RangerQH results were with a dropout of ps=[0.02,0.02,0.02] with lr=1e-2 and ps=[0.02,0.01,0.02] and lr=5e-2, with solo run results at ~0.98 at 400 epochs for RangerQH.

3 Likes

Perhaps large fancy architectures are like large fancy minds…
as an analogy…
That sometimes a focused simple mind has less capacity to run in circles and trick itself
Very smart people have the ability to trick themselves in all manner of ways

to put more bluntly
a more simple architecture can not afford to waste resources and is forced to be more direct
so having a network that more fits the actual problem constrains the network to the problem

just a thought