How to determine which variables are most closely associated with the target? Tabular Lesson 4

mike00 · July 10, 2020, 1:41pm

Lesson 4 Tabular predicts a binary answer (<= or > x) based on 10 other pieces of information. My problem differs a bit and I need some help.

I have access to a large tabular data file on Alzheimer’s. Each record in it includes the age of onset of the disease. Ages of onset range from 15 to 105, for about 90 possibilities. I did not have trouble with predicting a non-binary target. Seems like the tabular model had that capability built in. It doesn’t really answer the question I have.

I need to find the values that have the greatest correlation to the age of onset. In addition to age of decline, there are 964 other variables in each record (there are about 37k records).

Can anyone suggest how to get this to work? Please don’t be shy about using baby words and simple concepts. I’m a newby and really not skilled.

Obviously ignoring my plea for simple concepts, @muellerzr suggested these resources:
Fastinference: https://muellerzr.github.io/fastinference/
FI discussion: Feature importance in deep learning
Pavel’s notebooks: https://github.com/Pak911/fastai2-tabular-interpretation

Just kidding, Zachary! Thank you for the help.

Anybody else got anything?

Ezno · July 10, 2020, 4:22pm

A few simple ideas for interpretation that I have gotten from Jeremy in his Intro to Machine Learning Course. He was talking about random forests, but I don’t see any reason the same thing wouldn’t work for a Neural Network or other models. Of course if the goal is interpretation on tabular data a random forest may be ideal to work with for at least part of the project.

Train your model then predict on the validation set. Record the score. Then randomly shuffle a feature and predict using the validation set with the randomly shuffles column (same model, no need to retrain). Record how much lower this score is. Repeat on all features and you can see when a feature is randomly shuffled (ie no longer predictive of anything, but mean and standard deviation is the same), which features negatively impact predictions the most. This is a no assumptions method to getting feature importance.
Train your model. Take a variable (ie Age) on your validation set and set it to a constant (ie 60) for all rows. Predict and record the results for age 60. Then set Age to 61 for all rows and predict. Then set age to 62 for all rows and predict. Keep going, then graph them all to get a partial dependence plot which shows, all other things being equal how does this specific values on a given feature relate to the dependent variable

Feel free to watch this video for more information. http://course18.fast.ai/lessonsml1/lesson4.html

muellerzr · July 10, 2020, 4:25pm

What @Ezno says. Both (permutation importance and partial dependency plots) are supported in fastinference natively (PDP) are in the works, need to fix a few bugs but it’s available in the dev version. However they are also both in Pak’s work)