Some Baselines for other Tabular Datasets with fastai2

Great! Would love more explaination on the notebook if possible too :slight_smile: (if not I’ll get to it later, I’m getting familiar with shap as I go)

Big thanks to @JonathanR, his v1 code helped tremendously

(also grabbing cols will be easier in the next version update)

1 Like

I will start it once I finish porting manifold mixup to V2 (so sunday I think).

My aim is to put the code in a .py file (that people can just dump in their projects), refactor it to improve the API if possible, add some doc, a readme with explainations/links and a demo notebook to illustrate it.

(and maybe a V1 equivalent)

1 Like

Look into nbdev! (if you haven’t already). If I can’t port my implementation over (for whatever reason) into the fastai2 library directly, I’ll be doing the same so people can just pip install fastaishap

1 Like

This can already be done with tools such as Shap but it is nice to have it baked into the model and easily accessible.

So we should look at using Shap with fastai and provide examples for people possibly. While it’s not in the model itself, it’s still available ideally in a line of code.

I would like to point one relevant aspect: required assumptions.

SHAP uses interventional substitutions of the feature values and makes predictions based on those modified samples. It requires independence between the features if we do not want to cause domain shift. This was pointed recently in https://arxiv.org/pdf/1910.13413.pdf and earlier in https://christophm.github.io/interpretable-ml-book/shap.html.

Simply speaking because of the causality problem some of the Explainers in SHAP might give incorrect results if the independence assumption is not fulfilled. In tabular data unfortunately, features (more often than not) are not independent.
The documentation itself mentions it in some places, but it doesn’t state clearly which Explainers are affected and which are not. For instance TreeExplainer seems to be ok if

feature_perturbation=“tree_path_dependent”. https://shap.readthedocs.io/en/latest/#shap.TreeExplainer

but it’s not useful for NN.
SamplingExplainer requires that assumption always. KernelExplainer seems to me affected too, because again the variables are ‘set’:

To determine the impact of a feature, that feature is set to “missing” and the change in the model output is observed. https://shap.readthedocs.io/en/latest/#shap.KernelExplainer

DeepExplainer in the original paper (https://arxiv.org/abs/1705.07874) seems to be under the same assumption.
GradientExplainer might be a good candidate, but I have not read the referred paper.

Forgive me a long post.

2 Likes

Great analysis @hubert.misztela! I wasn’t simply saying we should be “done” there as those are certainly problems. Would those possibly be answered through dependency plots as well? As now we can see the “whole board”. IIRC I saw that they are supported.

The “missing” is the same as permutation importance, just no value at all (something we’ve used extensively). It’s through this that we get to see what’s being affected and from what you say they’re doing something extremely similar. (Or am I completely off the ball here?)

I’ll also note I just stared playing with this tool, I have much to learn :slight_smile: (and fantastic explanation and cover!!!)

And of course it’s a great thread!
We needed to show some love for tabular data :stuck_out_tongue:

2 Likes

Yes, there’s a bit of irony here, that we, humans, pack most of our data in tables (I mean that if you ask a random person what the ‘real’ data is: financial excel sheet or a picture of a dog, I think most will choose the first one) and yet it seems like we simply don’t know how to process this tabular data efficiently with NN, with any elaborated methods other than simple fully connected layers network :wink:

1 Like

And as for the problem hubert mentioned (dependent features), it seems like a very big deal in real data I’ve encountered with. In fact its sometimes just hard to find a single isolated feature (value of a feature that you can change to other value from this column without changing other columns as well). And that’s why I’m starting to think that more often than not even feature importance we should do not for a single column, but for pair (or more) of dependent columns as a whole (maybe we should make some correlation analysis first or use the domain knowledge) :frowning:

1 Like

@muellerzr if I understand you correctly you would like to check for independence of the variables by inspecting the dependence plot, correct? Well in some cases, yes. If we observe statistical dependence and we understand the variables enough to assume that there might be a causal link we could say that the assumptions are not fulfilled, but it’s not always the case. We could observe dependence, while in reality changing (as the thing here is about changing actively variables, setting, intervening - you name it) one variable would not affect the other, which some people call it spurious correlation and the assumptions are fulfilled. On the other hand we could see no dependence at all in our dataset, while the vars can be linked causally.

It is rather about causal relations between the variables than about statistical independence.

How to do causal discovery (discover causal links between the variables) is still an open question. In theory it is possible only up to some point, but in practice meta learning methods in some cases appear promising. Usually in practice it is easy to spot potential links between variables and we can refute the assumption, but we would like to rather keep it (i.e. assume independence)…

1 Like

But recording how each variable was used during training is doable, so I like the attention idea. Nevertheless fully connected NN should have that information, the question is how to extract it.

1 Like

What if say we do a super small random subset to start (say n=100-1000), and then we do one column at a time until all columns at a time to find the pair relationships? Or how high should we go? I say very small cause that’s a potential to be n^2-1 complexity. What’s too much? Would say 50% of the columns be fine? Lots of questions :slight_smile:

Timing shouldn’t be much of an issue as these are super fast models (that can be pushed further once Rapids is fully integrated). Let me know @pak or @hubert.misztela :slight_smile:

1 Like

Given a trained model f(x0,...,xn), we might be able to learn the feature importance by simulating the attention.

For each feature xi, we can associate an uninformative value mi (the mean or the mean of the embedding). We can then create a dampened feature xi' = ai*xi + (1-ai)*mi with the sum of all ai equal to 1 (this can be enforced by a sofmax).

We now maximize the accuracy of f(x0',...,xn') by optimizing the ai, these are our importances.
(that’s a way to ask to network to make a decision : which features can be dropped and which features bring information to the plate ?)

There is one hidden parameter here which is the fact that the ai sum to one : I see no particular reason to use 1 and not another number between 1 and n-1 (given n features).
Let’s call this parameter k, in a way it is what we think is the number of important features (my gut feeling would be that sqrt(n) should be a good default).

I think we also can have a hint on how pairs of columns are dependent to each other by comparing sum of permutation feature importance of each of two and a pair as a whole

1 Like

So the experiment would be to take say 5 overall features, run this pair-wise test, and see how they compare yes? If the sum is similar, then we can assume it as such. If not the we try something else? Perhaps akin to @nestorDemeure’s idea

What if I told you that even if you measure all of the combinations and they are statistically independent in your dataset you can’t be sure that there is no relationship between them? Why? Because it is only your dataset and there could be a specific distribution phenomenon (and I think it’s related to the domain shift problem recently brought up by @jeremy on Twitter https://twitter.com/jeremyphoward/status/1223305148182609920?s=20).
On the other hand if you observe that two variables are not independent it does not necessarily mean that you cannot manipulate them for simulation like SHAP.
Neither outcome is sure.

Of course if we want to learn more about the variables for model training purposes (being aware that the results can be misleading) that is ok, but I would not use it for real life use case model interpretation.

What could be done without the burden of independence assumptions is to use methods which don’t simulate interventions.

1 Like

Could you provide any examples of how to do so? :slight_smile:

@nestorDemeure this direction of thinking seems interesting.
So you would get as the output ai values, one for each feature.
Intuitively what you have proposed is what attention from TabNet does, but only at the feature level (no sample level explanation) and after training of the model, right?
The question is how would that summarize different distributions of importance (feature very important but only for a few samples vs frequently important but just a bit).
Would that be easier to test experimentally or analyze from the math point of few?:thinking:
Have you tried to implement that?

I’d suggest to move the conversation to Feature Importance in deep learning.

2 Likes

Ran some experiments with the Poker dataset using @fmobrj75 and @muellerzr’s setup using Ranger, Mish, and longer training. Achieved a new average valid accuracy high of 0.99576 with ReLU at 600 epochs:

Average Accuracy: 0.99576
Accuracy Std: 0.00046
Average Total Epochs: 447.20
Epochs Std: 85.08 

There appears to be diminishing returns for training longer, 800 epochs resulted with 0.99540.

Using Mish instead of ReLU in tabular_learner resulted in lower average scores but higher variance. 400 epochs had an average accuracy of 0.98461, but max of 0.995 and min of 0.951. Increasing epochs consistently resulted with better accuracy, 800 epochs had an average accuracy of 0.99432.

Gist with the Mish & ReLU results.

I didn’t have time to test with multiple runs with Ranger, Mish, and fit_flat_cos. Training with Ranger and Mish resulted with significantly worse generalization until I increased dropout. And even then lagged behind Adam and fit_one_cycle. RangerQH appeared to work better than Ranger.

Best Ranger and RangerQH results were with a dropout of ps=[0.02,0.02,0.02] with lr=1e-2 and ps=[0.02,0.01,0.02] and lr=5e-2, with solo run results at ~0.98 at 400 epochs for RangerQH.

3 Likes

Perhaps large fancy architectures are like large fancy minds…
as an analogy…
That sometimes a focused simple mind has less capacity to run in circles and trick itself
Very smart people have the ability to trick themselves in all manner of ways

to put more bluntly
a more simple architecture can not afford to waste resources and is forced to be more direct
so having a network that more fits the actual problem constrains the network to the problem

just a thought