When are random forests better than deep learning?

jfang · June 27, 2019, 2:48am

As I don’t see many papers that use anything other than deep learning in the medical machine learning literature, such as Google’s Retinal Fundus Image classification paper as published in 2018, when should I use random forests for machine learning tasks?

I also notice that the papers would boast a large dataset of tens of thousands of examples, which I would probably not have access to yet for my purposes. Right now, I have a medical machine learning task (surgery) for 300 examples with 40-50 patient descriptive variables/lab tests, before including the vast amounts of time-index data regarding patient biometric variables. Should I use LSTM for this task, a random forest, or a mix of both?

Thanks!

ctwardy · June 28, 2019, 4:23pm

Hi James - thanks for posting!

For medical image classification, deep nets are clearly the way to go. But that’s because they can discover better image features than we can usually create. For a small tabular dataset (300 examples x 50 lab tests), if the features are already informative, many things should work well.

I’d start with a simple baseline like logistic regression or naive Bayes with feature selection, and get your cross-validated performance baseline. A random forest should be able to beat that. I had good luck learning Bayes nets on the cardiovascular data - but I had many more cases.

You say you also have time-index data for the features. LSTM is a natural choice as it learns the decay window. But if only the last few time slices are relevant, random forest or logistic regression may be able to match it for cheap. I haven’t as much experience there.

jfang · June 29, 2019, 11:19pm

Thanks Charles! At what number of examples would deep learning be more useful than the simple techniques? Or is this mostly a trial & error process?

ctwardy · July 1, 2019, 2:18pm

I don’t know. Below I say how I think about it, but I haven’t tried FastAI enough to say.

With images, it took millions to train the base classifiers, but now you can start with those, and quickly teach it to discriminate x from y. I think language is playing out similarly?

Deep nets are good because they learn features. Neural nets are good because they can make complicated decision boundaries. The trick is to avoid overfitting.

FastAI seems to do that pretty well. So by all means try it out. Me, I’d compare it to a baseline. Maybe someone here already has.

muellerzr · July 1, 2019, 2:28pm

I’ve been doing some research with neural networks on tabular data and in some cases the results are quite impressive. I have not done time-variant data yet but from following the Time-Series subgroup we have on here, it looks promising and I’m hoping I can mess around with a dataset soon just to see how powerful it can be. What makes neural networks stand out against RF is our embeddings that we can provide to the model, which can benefit greatly.

My advice would be do both if possible, for instance on a small subset of data, and see which works better for your particular problem. Then move from there. I love NN’s and they can be strong, but also just limiting yourself to one option can be a pitfall.

Cheers

jfang · July 27, 2019, 7:46am

Thanks @muellerzr ! How would embeddings that you’ve mentioned help us (i.e. entity embeddings mentioned in the Rossmann competition)?

muellerzr · July 27, 2019, 12:47pm

They allow for a much further use for your model in some cases. For instance, say I have a particular category with n^3 available options. Each and every single one of those options can have a matrix n^3 by 600 long. And having this embedding I’ve seen helps a lot with the accuracy of the models.