Deep Learning for Tabular Data: An Exploratory Study - By Jan Andre Marais

I wanted to bring to light a fantastic thesis by Mr. Marais here. It’s 144 pages long and explores fastai's tabular model in a way I’ve never seen done before. Here is a link to it. I wanted to open up this thread as a discussion for it as it’s well deserved in my opinion. I’ll highlight here important parts/chapters worth noting and exploring. Hopefully those of you interested in the tabular field can join in on the conversation! :slight_smile:


Section 4.2:

4.2.3 - Combining features

Section 4.3:

4.3.1 - Attention
4.3.2 - Self-Normalizing Neural Networks

Section 4.4: (Literally all of it)

4.4.1 - Data Augmentation
4.4.2 - Unsupervised pretraining
4.4.3 - Regularization

Section 4.5: Interpretation

Section 4.6: Hyperparameters (such as # of layers and Drop Out)

Section 5:

5.4: Embedding size
5.5: Attention, SeLU and Skip Connections
5.6: Data Augmentation and Pretraining

The code itself is pretty hard to walk through honestly, and I tried reaching out to get parts of it to no avail, but we all could certainly try to re-code it given his description :slight_smile: (such as MixUp and SwapNoise)


This is great. Thanks.

Can’t believe I never heard of this work … and its from last year!!! Nice find and share :slight_smile:

Especially interested to read/discuss time series/forecasting related topics in this read.

1 Like

Very nice find. And for a masters, this is long thesis.

Very nice read. The unfortunate conclusion is none of the tricks appear to have worked in their own experiments.

I think this may be a dataset choice issue, which is why I’ll be trying them out still. The salary dataset is extremely rough and there’s tons of other datasets to choose from. I’ll be trying it on all of the datasets from the Tabular Baselines I did earlier as that provides a better encompassment

(For context SOTA on Salary is only 88/89%, barely above what fastai gets)

When you do please add gauss rank standardisation to the mix. Intuitively it makes a ton of sense to ensure the input features are evenly spread. And I’m betting swap noise will provide additional benefits as a form of augmentation.

1 Like

Zachary, don’t you recall in what, roughly, section he investigates 3 as an optimum number of layers. I’ve went thru the parts were I expected him to choose it, but I saw only the result (as he already used 3 as an optimum number)

@Pak check out in the Appendix A.1, page 131:

The 3 layer network seemed consistently outperform the other networks in terms of accuracy. We observe that increasing layer width reaches a point of diminishing returns. For the sake of simplicity we would recommend to use a 3-layer network with either 128 or 512 units.