Another treat! Early access to Intro To Machine Learning videos

jeremy · October 27, 2017, 3:48am

We’ll be officially releasing a new course (tentatively) called Machine Learning For Coders soon (it’s not up on the website yet). It’s being recorded with the masters students at MSAN. I’ve decided to share the videos with you all as well, since for those of you that haven’t done any ML before, you might find this a helpful additional resource. It uses the same fastai library as our deep learning course, so your existing repo and AWS instance will work fine.

Here’s the videos - I’ll update this thread as they’re available:

Please use this thread for any questions about the Machine Learning videos, or related ML topics.

vikram · October 27, 2017, 3:51am

Thank you so much!

A_TF57 · October 27, 2017, 3:51am

The day just keeps getting better! Thanks @jeremy!

anandsaha · October 27, 2017, 4:11am

That’s bonus after bonus, @jeremy! Thank you for all this

-Anand

nafizh · October 27, 2017, 4:37am

Thank you Jeremy. This is great. Are the notebooks also available?

nafizh · October 27, 2017, 4:46am

Aaah, just found it.

github.com

fastai/fastai/blob/master/courses/ml1/lesson1-rf.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro to Random Forests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## About this course"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true

This file has been truncated. show original

abdel · October 27, 2017, 5:50am

Ha, I was actually about to ask about this after seeing the ml1 directory and the notebooks in the repository. It’s great that you’re sharing this, will be very helpful, thanks!

yayan · October 27, 2017, 6:36am

Thanks @jeremy, what topics do you cover on this course? Having an outline would be nice

DavideBoschetto · October 27, 2017, 6:39am

One treat each day, you’re spoiling us!
Thanks a lot, will be useful to see again come of the basics! I have a colleague that would be VERY interested in watching this: can we maybe watch these together?

ecdrid · October 27, 2017, 6:59am

###Awesome Blogs Explaining Decision Trees-

Hope it’s Useful…

ar_ai · October 27, 2017, 7:33am

Two doubts I have after watching the first video are:
1)When we impute the missing values, is it worthwhile to think about dividing the columns into groups(like holidays and working days) and imputing with group median rather than column median or the fact that we are assuming something is important introduces unnecessary bias.

2)When we try to impute every missing categorical variable with zero, aren’t we skewing the data from its original distribution? Why we are not imputing with the most common value.

Ekami · October 27, 2017, 7:54am

Thank you so much!!

arjunrajkumar · October 27, 2017, 8:36am

Thank you! Going to go thru this today afternoon!

binga · October 27, 2017, 9:31am

You could always go deeper into the data, figure out “local groups” and assign group medians to missing values. There’s nothing wrong with this strategy. It’s just your way of doing it and you can always use cross-validation to see what works.

The premise that you are skewing the data depends on the kind of model you use and the way your model treats the imputed value.

For example, if you use tree-based models, imputing with -1 is a commonly used strategy and one intuition why it works is “your model treats all these missing values as a separate level by itself”. It’s like one faulty machine in a large production line not recording this variable and hence they are missing in your dataset and your model is treating all these missing values as originating from one unit.

However, if you use linear equation based models (linear / svms / neural networks), a mean / median / 0 based approach is preferred since the linear optimization is severely affected by the imputation process. Hence, as the mean / median doesn’t change the distribution of the feature, you could take this approach.

ar_ai · October 27, 2017, 9:40am

Thanks for the clarifications. Yes, I will try to test various approaches on the data and use cross- validation to see what works best.

Can you clarify how 0 based approach doesn’t change the distribution of the feature?

Ekami · October 27, 2017, 9:40am

Hey @jeremy I just noticed at 17:00 you explain how to retrieve datasets from Kaggle and upload them to your deep learning instance. I find the process a bit overwhelming for something really simple to do.
If you’re interested I’ve created a library to automatically download Kaggle datasets from code.

For the library if you don’t have a couple of login/password from Kaggle (like if you registered from Google OAuth) you can create one by logging out and clicking on “password recovery” from there.

Hope it helps somehow

binga · October 27, 2017, 10:07am

Apologies for not being very clear.

The univariate distribution of the feature does indeed change by introducing a zero. You are right. However, when you use approaches like matrix factorization, the absence of a feature in a row is similar to having a zero. Data formats of libsvm, libfm treat both the cases similarly. These algorithms factorize the dense matrix into sub-matrices and latent factors are obtained.

Hence, if you use the factorization approach, you are good to go with zero-based imputation. To re-iterate, the downstream algorithm has a say in your imputation strategy.

ecdrid · October 27, 2017, 11:53am

Matrix factorization?

rrherr · October 27, 2017, 12:32pm

Awesome, thank you! I’m particularly looking forward to your Lesson 2 on Random Forest interpretation.

bhollan · October 27, 2017, 1:25pm

@jeremy, you keep taking chances sharing things with us early. Thanks for believing in us!