Another treat! Early access to Intro To Machine Learning videos

sashankpappu · July 15, 2018, 7:03pm

yes , its the same … I later realized that feature logic is not working in my system now sure why .

SKS · July 16, 2018, 2:39am

Can you tell what’s the shape of df and y?
& what’s showing after you run proc_df , I mean any error or anything which will be helpful to figure what’s wrong ?

and also if at all df & y has values then can you post first 4-5 rows in here .

Thanks,
Sumit

Erick_Giffoni · July 16, 2018, 7:00pm

Thank you. It worked.

ecdrid · July 16, 2018, 7:41pm

Check your mem

ecdrid · July 16, 2018, 7:43pm

This is amazing!!

SKS · July 18, 2018, 1:17pm

can anyone please help me.

Thanks,
Sumit

datasciencegeek2018 · July 19, 2018, 5:56pm

Hi Jeremy since these videos are not to be found on courses.fast.ai and only on youtube am I right in presuming that they have not been formally launched as of yet for the ML course

yoongkang · July 21, 2018, 8:56am

Hey everyone! First post here.

In the video for lesson 7, about 18 minutes in @jeremy makes a point about imbalanced datasets. He mentioned a recent paper that looked at some approaches to deal with this and concluded that oversampling from the smaller category wins out consistently.

Has anyone tracked down this paper? I’d love to have a look at it.

sashankpappu · July 22, 2018, 4:45pm

When should we use One Hot Encoding & Categorical Codes . Can anyone help me on this ?
For eg : for countries what should we use ?

mcclomitz · July 23, 2018, 4:41pm

Use the train_cats() function which takes your data frame and changes the categorical variables to data type (dtype) category. Then when we call proc_df() these categorical variables will be changed to numeric variables. Its important that they all have the same spelling etc!

Check out 7:58 on Machine Learning 1 lesson 2 video.

Hope that helps

mcclomitz · July 23, 2018, 4:50pm

My guess is that its one of these on this scholar search.
Problem is they tend to be uber technical and in general you have to pay for them.

https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&as_ylo=2016&as_vis=1&q=training+with+imbalanced+data+set+oversampling&btnG=

Let me know if you find anything interesting

chandureddyvari · July 25, 2018, 10:13am

Hey, I’m also facing the very same issue. My kernel gets restarted when trying to run df, y, nas = proc_df(df_raw, ‘SalePrice’). This issue is seen with a dataframe size of 100. Where you able to resolve this issue?

init_27 · July 26, 2018, 3:12am

Cross Posting here for visibility:

I’ll host weekly discussions starting on the 12th of August. (ML MOOC)

Please take a second to vote for timings if you’re interested .

Sanyam

pasqal · July 26, 2018, 5:00pm

Additionally to those, I had to install:

pip install isoweek
pip install pandas-summary

To find out which dependencies you might be lacking, it is useful to start an interactive python session just by typing “python” on the command line prompt and then running
from fastai.imports import *
at the python “>>>” prompt until you get no errors. Any errors there should show the missing dependencies.

axelstram · July 27, 2018, 4:44pm

In one of the lectures Jeremy showed a library for interpreting random forests, but if I remember correctly he said that he didn’t know of a library that did the same for Neural Nets. A few days ago I came across a new library called SHAP, that apparently is not only for interpreting RF, but any ML model. Has anyone tried it?

Repo: https://github.com/slundberg/shap

sashankpappu · July 28, 2018, 4:05pm

can anyone help me on this ?

KevinB · July 31, 2018, 7:18pm

I am listening to lesson 5 and I am not sure I understand the extrapolation section. So you try to predict your validation set records (in my case, I have a holdout set) Then you take the feature importance of those and try to drop each of them and run the model like that. At that point I would expect you to keep the columns that would make the score worse if it weren’t in the model and drop anything that makes it better, but Age which when dropped doesn’t hurt but you still keep it in. Why is this not also dropped? I have tried implementing this in a real world scenario and I am not getting any of my columns that are making the model better when they shouldn’t, but when I predict a previous month and remove all the data after that point, I get fairly decent results, but when I try to predict the following month, I am not getting as good of results. I suspect data leakage of some sort, but I haven’t tracked it down yet.

KevinB · July 31, 2018, 7:22pm

I have used it and from what I can tell it does a pretty good job. I don’t think it is directly interpreting the model though so it is taking a simplified version of the model to make it’s assumptions I think. So sometimes the Feature importance list from the model will be different from the SHAP library. Overall I definitely think it has a lot of potential. I’ve been using it with XGBoost with pretty decent success.

sashankpappu · August 5, 2018, 6:47pm

Yes the issue is with creating the feather file . Not sure of internal issue but i stopped creating feather file and it solved the issued for me

spock · August 6, 2018, 8:24pm

i do not have any ML experience. Should I watch these ML lectures or do DL1 & 2. Is this ML course complete? like are all the materials on the jupyter notebook or is it in a state of progress? I saw there are only 5 notebooks in the ML course repository.
Also, i have almost completed the first lecture. should i be knowing all the details of scikit,pandas etc.?
I have done a bit of matplotlib. My problem with libraries is that I keep forgetting the module/library specific commands bcoz there are so many of them. Similarly, in the lecture, there are a lot of attributes,dot notations, but how does one remember all that. Documentation is there but how will I use features if I don’t know/remember them at the right time?
Also, after watching 1 lecture how much time should I dedicate to go through the notebook/kaggle datasets before watching other lectures?

I am thinking of going through this MOOC to gain better understanding of pandas,scikit ,data cleaning etc. (https://courses.edx.org/courses/course-v1:UCSanDiegoX+DSE200x+1T2018/course/)
Is that needed or should I just go through the notebook?