Deep Learning and Data Science?!

vishak · September 3, 2019, 6:13am

Hey Guys,

I’m upto Lesson 4 in Part 1 and I’ve been enjoying it a lot. But I had a question. The data scientists that work in my place barely know any DL. They generally work with Tabular data and use traditional ML methods.

So my question would be, what, other than the stuff I’m learning in fastai would make suitable to a data science role? And in what way is it different from DL roles.

There are a lot of articles from the internet but I wasn’t convinced with any of them. I’d like to hear about it from the fastai community.

jianshen92 · September 4, 2019, 4:50pm

Using deep learning on tabular data suffers from interpretability problem, i.e. you can’t really explain why it works, as well as why it doesn’t work.

Deep Learning will be applicable on data with very high dimensions features such as Image and Text, as we can make deduction from looking at a picture or a string of text and “try” to explain why certain strategy works or otherwise.

muellerzr · September 4, 2019, 4:54pm

I beg to differ there We have options like feature importance to try to help explain some aspects of it such as why are we looking at particular fields. In my research this has had a logical answer most of the time, and if not, after creative thinking we can discern the answer too

jianshen92 · September 4, 2019, 5:08pm

@muellerzr I’m not a formal data scientist yet, the interpretability issue are sentiments from a couple of experienced data scientist that I met. All of their arguments revolves around Neural Network being unpredictable, therefore it is hard to justify to a business user. When something went wrong, you don’t have an exact idea how to fix it.

It might be conservative thinking considered the region that I am in are not in the forefront in Data Science. What do you think?

muellerzr · September 4, 2019, 5:15pm

In terms of “what went wrong,” I’ve found most of it resolves with the data. Was the data collected okay? Are you preprocessing it properly? Are there underlying biases? If not, how is what I am doing compared to how other similar problems are looking. NN’s perform roughly the same if not a little bit better in a few cases and narrowing down where I may have gone wrong.

In terms of unpredictability, in what context? How the expected outlook should be? If so, that should resolve on the researcher/Data Scientist to put realistic expectations on the methods and what exactly they look like right now in terms of results

(I am no where near an expert, I have just been doing research with fastai for six months with tabular specifically)

vishak · September 4, 2019, 5:26pm

I think what he’s talking about would be something like, why is feature x so important to the outcome prediction?
With simpler models we can come with quicker/easier to explain correlations. With NNs you don’t know why feature x is so important.
I think that’s what he’s trying to convey.

muellerzr · September 4, 2019, 5:27pm

And that’s what feature importance explains See permutation importance, myself and another user on the forum had a very long discussion on how best to go about it and there are quite a few methods. As with tabular, our “features” are each of our variables, and so if we modify them we can see which the NN is fitting/relying the most with

jianshen92 · September 4, 2019, 5:40pm

I believe it is mostly how the expected outcome varies with the change of input. With statistical method it can be interpreted mathematically while in NN you kinda only have the weights and activation which don’t really mean anything.

And i also think that one of the reason that Statistical Method is still popular among data scientist is that NN has yet to outperform them significantly in tabular data.

muellerzr · September 4, 2019, 5:49pm

You bring up a very valid point, initially I struggled with explaining this to people who use RF and other techniques, I believe it does though? See Jeremy’s matrix calculus for deep learning. As they all fit to a line. If that’s not quite answering your question, how are we talking about explanations?

To this, I’ve outperformed RandomForests on a few research projects of mine which will be published later this year. Along with this see this which won a kaggle competition.

Is it most popular? No, certainly not. Will that change? It very easily could, just look at the Time-Series megathread for the tons of examples people have been applying it at.

That is my .02$ though

jianshen92 · September 4, 2019, 6:13pm

Just sharing what I’ve gather from people from the industry is they even think random forest might be too hard to interpret as well. But i believe Jeremy did a very good job in his ML course to make RF as interpretable as possible.

Will be nice if we can hear from fastai peeps who is working in the industry.

I am not in a position where I can make a clear stand on whether NN will take over statistical methods, but i certainly hope that it will!

muellerzr · September 4, 2019, 6:14pm

Oh wow! I find that a scary surprise! But I agree, the intro to ML helps a ton with that

jianshen92 · September 12, 2019, 2:54pm

@muellerzr do you have any resource to read about the feature significance in NN that you were talking about?

muellerzr · September 12, 2019, 3:03pm

There is one ^ but the basic example for a tabular model is like so:

Say we have 34 variables that together give my model an accuracy of 93%. I want to look at what variables it is using and how. I can do so through a technique called permutation importance. In it, we will run our fully trained model against some unseen data N+1 times, where N is our variables. Once for a ground truth, and the rest we shuffle, or permutate, a column and see how much of an impact it made. Through this we can quantify how valuable a variable was by that change in percentage. Does this make sense @jianshen92?

So now we have four sets: our actual test set (10%), a feature importance set (10%), a validation set (20%), and our training set (60%)