Another treat! Early access to Intro To Machine Learning videos


(Jeremy Howard) #1

We’ll be releasing a new course (tentatively) called Machine Learning For Coders towards the end of the year. It’s being recorded with the masters students at MSAN. I’ve decided to share the videos with you all as well, since for those of you that haven’t done any ML before, you might find this a helpful additional resource. It uses the same fastai library as our deep learning course, so your existing repo and AWS instance will work fine.

Here’s the videos - I’ll update this thread as they’re available:

Please use this thread for any questions about the Machine Learning videos, or related ML topics. And of course do not share these videos with anyone outside the course. My ability to share things like this depends on everyone being careful about the privacy of these pre-release materials.


Whoah... what happened to these forums?!
ML course in fast.ai?
Wiki: Lesson 1
Fast.ai | Is there a Machine Learning Course here?
Idea: Lessons Pinned thread
Merging data from multiples datasets
Wiki: Lesson 1
Unofficial release of part 1 v2
Welcome to Part 1 (v2)
Wiki: Lesson 3
Part 1 FAQ
Lesson 3 In-Class Discussion
Welcome to Part 1 (v2)
(Vikram Kalabi) #2

Thank you so much!


(Ankit Goila) #3

The day just keeps getting better! Thanks @jeremy! :smiley:


(Anand Saha) #4

That’s bonus after bonus, @jeremy! Thank you for all this :slight_smile:

-Anand


(Nafiz Hamid) #5

Thank you Jeremy. This is great. Are the notebooks also available?


(Nafiz Hamid) #6

Aaah, just found it.


(Abdelrahman Ahmed) #7

Ha, I was actually about to ask about this after seeing the ml1 directory and the notebooks in the repository. It’s great that you’re sharing this, will be very helpful, thanks!


(Brayan Impata) #8

Thanks @jeremy, what topics do you cover on this course? Having an outline would be nice :wink:


(Davide Boschetto) #9

One treat each day, you’re spoiling us!
Thanks a lot, will be useful to see again come of the basics! I have a colleague that would be VERY interested in watching this: can we maybe watch these together?


(Aditya) #10

###Awesome Blogs Explaining Decision Trees-:slight_smile:

  1. https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
  2. http://scikit-learn.org/stable/modules/tree.html
  3. http://dataaspirant.com/2017/04/21/visualize-decision-tree-python-graphviz/
  4. https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
  5. http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Hope it’s Useful…


#11

Two doubts I have after watching the first video are:
1)When we impute the missing values, is it worthwhile to think about dividing the columns into groups(like holidays and working days) and imputing with group median rather than column median or the fact that we are assuming something is important introduces unnecessary bias.

2)When we try to impute every missing categorical variable with zero, aren’t we skewing the data from its original distribution? Why we are not imputing with the most common value.


(Tuatini GODARD) #12

Thank you so much!! :smiley:


(Arjun Rajkumar) #13

Thank you! Going to go thru this today afternoon!


(Phani Srikanth) #14

You could always go deeper into the data, figure out “local groups” and assign group medians to missing values. There’s nothing wrong with this strategy. It’s just your way of doing it and you can always use cross-validation to see what works.

The premise that you are skewing the data depends on the kind of model you use and the way your model treats the imputed value.

For example, if you use tree-based models, imputing with -1 is a commonly used strategy and one intuition why it works is “your model treats all these missing values as a separate level by itself”. It’s like one faulty machine in a large production line not recording this variable and hence they are missing in your dataset and your model is treating all these missing values as originating from one unit.

However, if you use linear equation based models (linear / svms / neural networks), a mean / median / 0 based approach is preferred since the linear optimization is severely affected by the imputation process. Hence, as the mean / median doesn’t change the distribution of the feature, you could take this approach.


#15

Thanks for the clarifications. Yes, I will try to test various approaches on the data and use cross- validation to see what works best.

Can you clarify how 0 based approach doesn’t change the distribution of the feature?


(Tuatini GODARD) #16

Hey @jeremy I just noticed at 17:00 you explain how to retrieve datasets from Kaggle and upload them to your deep learning instance. I find the process a bit overwhelming for something really simple to do.
If you’re interested I’ve created a library to automatically download Kaggle datasets from code.

For the library if you don’t have a couple of login/password from Kaggle (like if you registered from Google OAuth) you can create one by logging out and clicking on “password recovery” from there.

Hope it helps somehow :slight_smile:


Lesson 1 discussion
(Phani Srikanth) #17

Apologies for not being very clear.

The univariate distribution of the feature does indeed change by introducing a zero. You are right. However, when you use approaches like matrix factorization, the absence of a feature in a row is similar to having a zero. Data formats of libsvm, libfm treat both the cases similarly. These algorithms factorize the dense matrix into sub-matrices and latent factors are obtained.

Hence, if you use the factorization approach, you are good to go with zero-based imputation. To re-iterate, the downstream algorithm has a say in your imputation strategy.


(Aditya) #18

Matrix factorization?


(Ryan Herr) #19

Awesome, thank you! I’m particularly looking forward to your Lesson 2 on Random Forest interpretation.


(Brian Holland) #20

@jeremy, you keep taking chances sharing things with us early. Thanks for believing in us!