Why aren't there many pre-trained models for tabular data?

toomanyrichies · July 28, 2021, 3:57pm

In the first lesson, when discussing the tabular problem of predicting high-income earners, the instructors write:

There is no pretrained model available for this task (in general, pretrained models are not widely available for any tabular modeling tasks, although some organizations have created them for internal use)…

However, no reason is given for this lack of pre-trained models for tabular problems. I’m curious what it is about tabular problems that make them harder to offer pre-trained models for, compared to (for example) image classification or language models.

dangraf · July 28, 2021, 8:48pm

I guess it’s possible but the type of data differs a bit from images. The body network finds relations between pixels when analyzing images generating a latent feature space. This space is often independent of image-size, type of objects (since it has seen many different patterns) etc. the latent features is sent to the “head” for classification. This approach makes it general for almost any images.

The current implementation of the network for tabular data is only 2 layers and have both continuous and categorical data as input. This means that you somehow need to pre-train a network on many columns and then try to match your current data to that input somehow. Lets say you pre-train on “bulldozers” dataset and then want to use that as a body for eg fraud-detection for bank-accounts or predict electrical consumption depending on number of users, time of day etc. The relations between the columns is totally different.
Text-analysis are also using embeddings like the categorical colums for tabular data. When using transfer learning for text- analysis you first match all similar words to keep those embeddings and then create random weights for the new words that does not currently exist in the pre-trained model.
If we use the same approach for tabular data we need somehow to find similar categories between the different datasets.
I guess it’s possible but it’s not straight forward with the current implementation and the wide variation between tabular datasets.

dhruv.metha · July 29, 2021, 2:32am

Why do we use pretrained models? We want to utilise the patterns that the models has learnt from a lot of data. Why majorly image and text? The patterns that the model learns are common and can be easily transferred. A cat is always going to look like a cat. However, the semantic meaning of the columns and the relations between them (linear/non-linear) are patterns that can hardly be reused outside the context of the problem. If the context and the columns relations are the same (same data, but a different task) then we can use pretrained tabular data to fine tune.

toomanyrichies · July 31, 2021, 8:39pm

Thanks @dhruv.metha and @dangraf, it makes much more sense now.

Conwyn · September 19, 2021, 6:28pm

Hi Richie

Perhaps if the data was standard it would be possible to use pre-trained data. Consider company financial reporting (balance sheet, income statement, and statement of cash ) which is common across all companies. It might be possible to use a pre-trained model to produce an output which says a company is a good investement.

Regards Conwyn