Interesting paper using convolutional networks and transfer learning on tabular data.
This is absolutely fascinating (and really well written)! I’m going to have to read it through a few more times, but this sounds like a fun project to develop. I did a (very) quick google search and came up blank - do you know if there’s a code example floating out there?
Correct me if I’m wrong but did they not just take their important variables they wanted to use, chuck it into a picture, and classify as such? I’ve been slowly reading it for the past few weeks trying to make sense of it in my head and want to be sure I’m reading that right.
I think I may be able to do an implementation of this. I’ll try it on the Adult dataset and I’ll post my results when I have them.
That’s how I’m reading it.
Awesome. I’m starting it now, I’ll post here and on a separate thread with my results and if I manage to beat Jeremy’s score
The use of images for tabular classification is discussed extensively in the time series analysis group here:
There is a whole library here dedicated to time series to image transformations:
I’m pretty skeptical of the author’s approach because they don’t provide enough details to replicate the image creation process and they don’t provide code. They elude to the fact they use feature importance to govern font size used in the image creation but I don’t see any way to reproduce their results from their paper.
Considering the transformation of tabular data to images is the key innovation in the paper, this is the only discussion of how this is done:
Algorithm 2 SuperTML EF: SuperTML method with Equal Fontsize for embedding.
Input: Tabular data training set
Parameter: Imagesize of the generated SuperTML images
Output: Finetuned CNN model
1: for each sample in the tabular data do
2: for each feature of the sample do
3: Draw the feature in the same fontsize without overlapping, such that the total features of the sample will occupy the imagesize as much as possible.
4: end for
5: end for
They elude to the fact they use feature importance to govern font size used in the image creation but I don’t see any way to reproduce their results from their paper.
In fairness, they do mention that the version using feature importance to govern font size wasn’t any more predictive than the version without. I’m having a hard time conceptualizing why this would work better than the traditional approach to tabular data, but am hopeful I (or more likely a better programmer like @muellerzr) will be able to prove/disprove.
I also found this confusing as well. I’m trying my best to recreate what they describe as close as possible. It looks like the bottom two rows essentially turn into 4x4 boxes if you would of text. However I ran into issues with their feature selection and choices. Due to this fact, I may deviate from the paper slightly in that regard along a few others.
I share @whamp’s skepticism about this paper - at the beginning it even seemed to me like a kind of a joke. Don’t get me wrong - the idea of converting tabular data to images and using pre-trained models to classify them is very interesting and promising. However, the conversion to images that they propose is not making any sense to me. Why convert nice numeric data, which can be (relatively) easily used by a model, to a bunch of noisy letters and digits, and force the model first to understand these arbitrary characters and then predict an answer?
What if the letters would have been converted to Hebrew or Russian characters? the model should still work since it doesn’t understand English any better than these languages. So if it will work it means that the model has to first understand the representation of a language - a hard task indeed - and then solve the original task. In a similar way, using a different font, or color, or whatever should also work and that means that the model must obtain a very high level knowledge about the world.
A simpler conversion idea, to my view, would be to map each feature value to a different color pixel in the image (and if there is a temporal data involved use time as one of the image dimensions).
Also, the paper’s general writing level is pretty poor with many typos (not that I write so well myself, but I expect some level from a published paper). Also, they cite in 2019 irrelevant information from 2015-2016, for example the claim about XGBoost being the winning model for every structured competition on Kaggle which was correct in 2016 but I don’t think is true at 2019.
And a positive ending - thanks (Will) for the links to the time series discussions on the forum - they are very enriching!