Another treat! Early access to Intro To Machine Learning videos

Two doubts I have after watching the first video are:
1)When we impute the missing values, is it worthwhile to think about dividing the columns into groups(like holidays and working days) and imputing with group median rather than column median or the fact that we are assuming something is important introduces unnecessary bias.

2)When we try to impute every missing categorical variable with zero, aren’t we skewing the data from its original distribution? Why we are not imputing with the most common value.

2 Likes

Thank you so much!! :smiley:

Thank you! Going to go thru this today afternoon!

You could always go deeper into the data, figure out “local groups” and assign group medians to missing values. There’s nothing wrong with this strategy. It’s just your way of doing it and you can always use cross-validation to see what works.

The premise that you are skewing the data depends on the kind of model you use and the way your model treats the imputed value.

For example, if you use tree-based models, imputing with -1 is a commonly used strategy and one intuition why it works is “your model treats all these missing values as a separate level by itself”. It’s like one faulty machine in a large production line not recording this variable and hence they are missing in your dataset and your model is treating all these missing values as originating from one unit.

However, if you use linear equation based models (linear / svms / neural networks), a mean / median / 0 based approach is preferred since the linear optimization is severely affected by the imputation process. Hence, as the mean / median doesn’t change the distribution of the feature, you could take this approach.

13 Likes

Thanks for the clarifications. Yes, I will try to test various approaches on the data and use cross- validation to see what works best.

Can you clarify how 0 based approach doesn’t change the distribution of the feature?

Hey @jeremy I just noticed at 17:00 you explain how to retrieve datasets from Kaggle and upload them to your deep learning instance. I find the process a bit overwhelming for something really simple to do.
If you’re interested I’ve created a library to automatically download Kaggle datasets from code.

For the library if you don’t have a couple of login/password from Kaggle (like if you registered from Google OAuth) you can create one by logging out and clicking on “password recovery” from there.

Hope it helps somehow :slight_smile:

15 Likes

Apologies for not being very clear.

The univariate distribution of the feature does indeed change by introducing a zero. You are right. However, when you use approaches like matrix factorization, the absence of a feature in a row is similar to having a zero. Data formats of libsvm, libfm treat both the cases similarly. These algorithms factorize the dense matrix into sub-matrices and latent factors are obtained.

Hence, if you use the factorization approach, you are good to go with zero-based imputation. To re-iterate, the downstream algorithm has a say in your imputation strategy.

2 Likes

Matrix factorization?

Awesome, thank you! I’m particularly looking forward to your Lesson 2 on Random Forest interpretation.

1 Like

@jeremy, you keep taking chances sharing things with us early. Thanks for believing in us!

1 Like

That’s covered in our existing Computational Linear Algebra course.

2 Likes

@jeremy
Do you suggest studying that course as well for a better understanding?

Great questions!

That could be useful, yes, if you have that information. Better still would be to create a categorical column with that grouping information.

Distribution doesn’t really matter for random forest. It doesn’t have distribution assumptions. Since we’re adding a “{name}_na” column, the decision trees can split on that as required.

7 Likes

For linear-based models handling missing values is much harder and requires a lot of domain-specific details. My advice in this class was specific to tree-based models like RFs.

7 Likes

If you have the time, it certainly has some useful foundations for deep learning - but it only comes up when you’re doing fairly advanced stuff.

2 Likes

No, that’s only for USF’s masters students.

Yeah, right. You did mention at the start of the class itself about RF not making distribution assumption. I didn’t connect the dots. It makes sense now.

Thank you @jeremy, this is awesome indeed. Appreciate all.

That looks nicely done! I was planning to show kg download but your tool looks better :slight_smile:

2 Likes

Can we automate it to always download the recently launched competition dataset (provided it’s within a predefined size)?

1 Like