We’ll be officially releasing a new course (tentatively) called Machine Learning For Coders soon (it’s not up on the website yet). It’s being recorded with the masters students at MSAN. I’ve decided to share the videos with you all as well, since for those of you that haven’t done any ML before, you might find this a helpful additional resource. It uses the same fastai library as our deep learning course, so your existing repo and AWS instance will work fine.
Here’s the videos - I’ll update this thread as they’re available:
One treat each day, you’re spoiling us!
Thanks a lot, will be useful to see again come of the basics! I have a colleague that would be VERY interested in watching this: can we maybe watch these together?
Two doubts I have after watching the first video are:
1)When we impute the missing values, is it worthwhile to think about dividing the columns into groups(like holidays and working days) and imputing with group median rather than column median or the fact that we are assuming something is important introduces unnecessary bias.
2)When we try to impute every missing categorical variable with zero, aren’t we skewing the data from its original distribution? Why we are not imputing with the most common value.
You could always go deeper into the data, figure out “local groups” and assign group medians to missing values. There’s nothing wrong with this strategy. It’s just your way of doing it and you can always use cross-validation to see what works.
The premise that you are skewing the data depends on the kind of model you use and the way your model treats the imputed value.
For example, if you use tree-based models, imputing with -1 is a commonly used strategy and one intuition why it works is “your model treats all these missing values as a separate level by itself”. It’s like one faulty machine in a large production line not recording this variable and hence they are missing in your dataset and your model is treating all these missing values as originating from one unit.
However, if you use linear equation based models (linear / svms / neural networks), a mean / median / 0 based approach is preferred since the linear optimization is severely affected by the imputation process. Hence, as the mean / median doesn’t change the distribution of the feature, you could take this approach.
Hey @jeremy I just noticed at 17:00 you explain how to retrieve datasets from Kaggle and upload them to your deep learning instance. I find the process a bit overwhelming for something really simple to do.
If you’re interested I’ve created a library to automatically download Kaggle datasets from code.
For the library if you don’t have a couple of login/password from Kaggle (like if you registered from Google OAuth) you can create one by logging out and clicking on “password recovery” from there.
The univariate distribution of the feature does indeed change by introducing a zero. You are right. However, when you use approaches like matrix factorization, the absence of a feature in a row is similar to having a zero. Data formats of libsvm, libfm treat both the cases similarly. These algorithms factorize the dense matrix into sub-matrices and latent factors are obtained.
Hence, if you use the factorization approach, you are good to go with zero-based imputation. To re-iterate, the downstream algorithm has a say in your imputation strategy.