Success with Deep Learning on Structured Data?

I work at a company with lots of structured data (i.e. rows and columns). I’v been trying for the last year plus to build state-of-the-art deep learning models on a few of these datasets. To be considered state-of-the-art, I generally mean that it would out-perform a gradient-boosting machine implemented via XGBoost or LightGBM. I’ve tried implementations in both keras+TF and now fastai+pytorch but have been unsuccessful to date to find a model that beats gradient-boosted machines.

I’m familiar with a couple kaggle competitions where deep learning has been effective such as Rossman and Porto Seguro. I’ve tried to replicate these solutions on my datasets, but no luck so far. So outside of those examples, can anyone share some success stories of deep learning on structured data? I’m trying to see if these Kaggle anecdotal success stories are harbingers of a deep learning takeover for structured data problems or if they’re outliers relative to the vast number of models being built on structure data all over industry.


Hi Patrick,

My industry (AEC) almost exclusively generates structured data. I have run a lot of studies on kind of dummy data (parametric building model studies that I run myself) with quite good success. The main difference I see is that I am able to get as good results with GBMs but in a fraction of the time required.

Of course, our datasets are not what you’d call ‘big data’ since we are limited from our cost functions (an energy simulation for a 4x4 room takes 30 secs, makes it hard to run 1m alternatives). But still, on a dataset with 40k rows (models) I trained a model that was nearly as good as an averaged GBM (meaning an average of 8 GBM models) in about 2mins of training time. I haven’t yet tried it on anything close to big data, but I would guess scaling up would favor the deep learning approach? Maybe you have some experiences on that.

I should say that our datasets are kind of perfect for this approach. Most, if not all, of our variables are categorical to begin with and the landscape itself is quite smooth in most cases. Of course, that’s an advantage for both methods I’d guess. I should also note that the best results I got with the TrainingPhase API which allows for much smarter learning rate schedules.

P.S.: Have you tried combining the two? Using the embeddings from the fastai approach on GBMs? Might seem counter-productive but I guess it’s a logical approach for highly dimensional categories?

Kind regards,


Thanks for sharing. I’m happy to hear you’ve found some success. A couple reactions to your story:

  • My work tends to be on datasets with a noisy response variable. By noisy I mean there’s just a lot of process variance which makes it inherently difficult to predict the target. It appears as if your modeling problems have a high level of signal to noise so perhaps that’s driving the differences in our successes.

  • I’m pleasantly surprised to hear that you can build accurate models on such small datasets. Again, I wonder if that’s because the data has a lot of signal and like you say a smooth loss surface. You’re right in assuming that the datasets I work with are much larger. They can be as large 10s of millions of records but I typically downsample to reduce computational burden.

  • I haven’t actually tried the Training Phase API yet. I will need to go back to the MOOC to find where that was introduced by Jeremy and then take a look at the code that was written by Sylvain. Thanks for the suggestions.

  • I have tried a few different deep learning feature engineering experiments. I was excited by the Rossman paper that pointed out that the embeddings also showed superior performance in non-deep-learning algorithms. I’ve seen some slight evidence that this technique works on the types of data in my industry but the results are not nearly as convincing as Rossman. I’ve also tried just using the intermediate activations from a deep network as features for a GBM under the assumption that intermediate activations serve as large, convoluted, interaction terms. Finally, I’ve tried denoising auto encoder for automated feature detection similar to what Michael Jahrer did in his winning Porto Seguro Kaggle competition. These last two have yet to bear fruit, but that could be my own fault and not the fault of the technique.

1 Like

I have used both lstm and embedding for order input forecasting in multiple channel distribution systems for technology products (e.g., routers). Also in market and price forecasting. I followed Jeremy’s model for embedding but implemented it in Python2.7 and Theano so that my client’s would more likely be familiar with the methods. I recently posed the results of one study on The code is available on my github dgraham999 under Forecast. You are right that noise is the most difficult to overcome while sparse data is second. I use a matrix forecasting method on my github under Sparse_Data to replace missing data. I am also doing a structured data study for a research hospital using MEG data to diagnose neurological afflictions. That’s in the repository PD. I am just now starting fastai so I can make these concepts work in faster production models.


Just as a note: In my experience I have found any kind of tree, boosted or not, to poorly generalize in my business models. I use it for feature importance to guide management responses but never to forecast a continuous variable such as order input. The upper and lower limit on trees in a continuous variable forecast is the prior periods max and min. One of my purposes is to locate customers, dealers, offices, etc. that are expected to exceed those boundaries. Trees has good classification performance but poor generalization into an unknown future for continuous variable targets. Good in the lab, not good in real forecasting


Hi Patrick,
Same feeling on the lack of performance from DL model against GBM approach.
I’ve experienced that in the current Home Credit Challenge on Kaggle. You can find some kernet with fastai