Should DL always outperform "classic" methods?

I know the rule of thumb is that no model universally outperforms another model for every problem. What I find interesting is that neural networks should be able to approximate any function, no matter how complex or simple.

Intuitively it seems like neural networks may have a strong tendency to overfit in scenarios where the relationship between the independent and dependent variables is relatively simple. Classic methods like linear or logistic regression are probably easier to tune in these scenarios, and may perform well with less work. That being said – it seems like a deep learning model should be able to get great performance with a proper architecture / tuning, even if it takes more effort.

Should DL models always be preferred in situations for which the complexity of the relationship between dependent and independent variables is unknown? How do DL models fit into the ML toolkit along side simpler models? Is it bad practice to jump straight to a neural net when first tackling a problem, as opposed to trying something simple first?

In my experience, traditional ML requires the data scientist to make feature engineering. Hence, a large amount of time is spent cleaning and transforming data, and comparing features statistically. DL is concerned about gradient descent and finding a network architecture where this is feasible.

I think a good rule of thumb is that if you can confidently tell the machine what features to look at, then by all means do so (e.g. LR, Decision Trees). Sometimes, I have no idea where to start and have found that simple NNs can perform surprisingly well.

Interesting @twairball – that rationale makes sense to me. One of the things that inspired this question was a problem I was facing at work, in which I wanted to be able to predict the revenue generated by a customer over the next twelve months. This is a regression problem and I was dealing with structured data (such as customer tenure, country, historical engagement, etc.). Interestingly, I was able to get pretty good performance out of a Random Forest, but could not seem to match that performance with a neural network. I seemed to be under-fitting, but adding more features was slowing the model down too much to run the network on my macbook (I don’t have access to a GPU, and since I’m dealing with proprietary data, I can’t spin up a GPU machine on AWS / GCP / Azure). I did try letting things run for a while in the background with larger network architectures but still couldn’t outperform my RF with a neural network. I’m wondering if perhaps a NN is just not well-suited for my problem (this seems unlikely to me given that NN’s can theoretically fit any function) or if there’s just something I’m missing (which seems more likely).

I think the key problem with NNs is training them.

It has been shown that deep neural nets are easier to train than shallow ones (on the same task), even if the shallow network is perfectly capable of representing the exact same function the deep network has learned. (Since the deep network contains a lot of redundancy.)

So while an NN can approximate any function (to any amount of precision), it needs to have enough neurons to be able to represent that function – but it also needs to have even more neurons to be able to learn that function.

So maybe you just need (a lot) more neurons. :wink:

I’m not an expert by any means, but I like to start out using a classical method. It gives you a good baseline for comparisons with more advanced methods, and you might just find that the classical method is already good enough.

Great points by people here.
I would like to add that the reason why neural networks may be poor for structured data is that NNs require a lot of data to train on.
Most deep learning architectures have many layers, which means there are thousands of learnable parameters. This is potentially the reason why its harder to train neural networks on less data, as it will just overfit

A wild guess. For your categorical data maybe one hot encoding or not makes a difference? Basically thinking maybe the way you are giving inputs makes a difference

Thanks for all the replies! Really interesting points here. I wonder if, with a bit more computing power, I could make the network deeper and get it to outperform the RF. It’s a bummer that I can’t share the data since it’s proprietary, but maybe I can simulate some similar data and we can see what it takes to get the NN to outperform the RF.

Also, @machinethink – could you pass along a text or paper that explains the deep-vs-shallow bit that you mentioned? It’s something I’ve been thinking about – whether to make the network architecture deeper or wider to fit the function adequately. Would be excited to learn more!

Here’s one: Do Deep Nets Really Need to be Deep? by Lei Jimmy Ba, Rich Caruana (2013)