Feature selection in Deep learning model on structured data

In lecture 5 of part 1, while making collaberative filter for movies , Jeremy did not use features like timestamp and movie jenre, my question is, for making deep learning models on a structured dataset , how do we decide which feature has to be selected and and which has to be rejected ?

This is a very good question :slight_smile: May you be taking part in a kaggle competition and hence be asking the question? There is some really good information there on the forums. But I feel like a crucial component is missing and I cannot quite put my finger on it.

How can removing a feature help? It can help if the relationship that the model learns between it and the target doesn’t generalize well to the test set. This can happen if the signal contained in the feature is so low, the situation resembles noise. But how can we say if this is the case? In tree based models, maybe there is no correlation between the target and the feature in absence of some features, but if we first split the data on some other feature, maybe this feature comes into play and suddenly becomes useful?

Another scenario where I can imagine a feature hurting performance is when there is something about the feature, some granularity, that allows fitting of the train set very well, seeing relationships that do not carry over to the test set. Where we are overfitting on the feature to the test set. But can such a scenario really happen? And if yes - does this mean we can try to salvage the value of the feature by transforming it, potentially adding noise, or do we need to throw it out?

Don’t know much about this but just thinking about it seems to be helpful. Would love any piece of literature on this that would go beyond ‘throw out uncorrelated features’, etc. The couple of things I have been thinking about in the context of this lately are generating features via genetic programming, denoising autoencoders and feature importance via permutations of the feature. Reading about either of these helps to build intuition but I am very far unfortunately form being to give a more decisive answer to your question.

I may have went overboard with my answer and if the question is simply why do we want to include certain features in a deep learning model, than the best answer in general is to keep what can be useful to a human being to make the distinction. If we feel that having a timestamp could help us, we should include it. But if we feel the timestamp is not relevant but how the person rated other movies, we should go for that instead. One way to look at this but very helpful.

Sorry if this does not answer your question but maybe some of this can be still useful.

4 Likes

Old post, but it’s a good question worthy of an answer, it’s something that a lot of people ask. At work we are often presented with datasets containing thousands of variables, but even for medium sized datasets I wouldn’t recommend just throwing all variables into the model - for a few reasons:

  • too many features can cause overfitting
  • can make it harder for your model to converge
  • monitoring hundreds of variables once your model is in production is impractical
  • performance - we often need inference on a single record to be fractions of a second, also if we’re getting predictions for millions of records then it soon adds up, particularly when exporting the model where there’s no GPU available

Use univariate importance metrics as a starting point, e.g. for a binary outcome use information value, or for a continuous outcome, r-squared - there’s plenty of other options. This relies on some kind of classing of your variables, so for continuous variables split them into quantiles, and for categorical then create dummy variables for anything with a decent amount of volume (say >5%) and anything else throw into an ‘Other’ bucket. This might sound like a ball ache but it’s just a case of writing a utility to do it and you’ll be able to re-use it for all of your projects.

Once you’ve done a first pass, throw away anything really weak, also for those variables that are highly correlated, keep the stronger one (at the risk of loosing something that may have a poor univariate importance but a strong interaction with anther variable).

Then build a model with everything in. Once you have a trained model you can use a technique to figure out the marginal importance of each variable. Basically, pass your training set through your trained model and generate predictions. Then freeze the entire dataset and change just one column at a time, setting it all to the mean (for continuous) or the mode (for categorical). Measuring the difference between the predictions before and after lets you figure out how much contribution each variable is making. I.e. if the predictions don’t change much then it’s not adding much. This allows you to do a second pass and throw away another batch of weak variables.

I generally find that throwing away weak variables barely hurts the overall performance of your model. So if practicality is an issue as it often is in production environments then this is the routine that I would follow.