I’m a bit confused about this concept. On ML lesson 1 jeremy explains that curse of dimensionality is a stupid concept. But after looking at lesson 3 where @jeremy shows that removing columns actually improve our models by removing some cardinality I got lost.
I thought about this curse of dimensionality as follow (which is now obviously flawed):
The more columns/features you add to your dataset, the better it is as these new “meta-data” won’t have any negative impact on the model predictive performances. The price to pay being to need more compute power to process the data.
But now I can clearly see from lesson 3 that having too many features that doesn’t matter much relative to the dependent variable actually reduces models (or at least RF) predictive power. So what should I think about it? Was I completely wrong about the curse of dimensionality meaning? Should I care about this only for RF?
I might be completely wrong on this but maybe:
Curse of dimensionality - past some point, adding another direction is problematic regardless what information it adds, there is sparsity, everywhere we look is this empty space, etc
Removing data that doesn’t add much value (or is effectively noise) - our algorithm has limited capacity to process data so we can either throw some good stuff at it along with some rubbish and have it waste cycles on trying to fit noise along with meaningful signals or we can zoom in to the good part, throw away the rubbish and have it focus its capability on processing the signal
I have no clue if that is what is happening here though - very interesting question @ekami Quite looking forward to the real answer now myself
(Just to clarify: this discussion is about the machine learning course. In the future, we should try to keep these as replies in the machine learning discussion thread.)
The curse of dimensionality isn’t meaningful in practice because out space isn’t just a bunch of meaningless cartesian coordinates. We create structure, using trees, neural nets, etc. We regularize using bagging, weight decay, dropout, etc. We find that therefore we actually can add lots of columns without seeing problems in practice.
Moved the discussion over to the ML thread.