Wiki thread: lesson 1

(Gary Allison) #23

I am still having the problem of the kernel restarting at the proc_df line in “lesson1-rf” on my paperspace machine. The crash appears to be triggered by the 4th line of code in proc_df:

else: df = df.copy()

Does that ring any bells with anyone?

I’m not having any problems running the notebook on my (v. slow) notebook, so I am slogging ahead with the lesson.

Thanks for any help!

(Gary Allison) #24

Looks like it is the ‘feathered’ version of the dataframe that is causing the crash.

For now, instead of saving as feather format. I’m using a pickled version and it seems to work.:
At the end of Initial processing

os.makedirs(‘tmp’, exist_ok=True)

then, in pre-processing:

df_raw = pd.read_pickle(‘tmp/bulldozers-pkl’)

Also, I reinstalled feather with:

conda install -c conda-forge feather-format

and that seems to have eliminated the problem (though there are new deprecation warnings).

(Avinash Singh Pundhir) #25

Thanks for sharing the workaround. I will try this and see if this works for me.


How does one submit to Kaggle for example in the House Prices competition?

Someone earlier answered by linking to the 3rd lesson of DL1, but it does not help much for the machine learning category.

Once I have my model which predicts ‘SalePrice’, what do I have to do to get the predictions for each house and save them to a CSV file along with the houses’ ids?


You might want to open a new thread.

But maybe this explanation of the expected format already helps:


If you still look for an answer: you need to access the column with brackets.



While looking at the pandas documentation, I see a method called “get_dummies”:

which can convert categorical values to indicator/dummy variables.

I ran it on the bulldozer dataset and the output is similar to “one hot encoding”.

So, I am wondering - which is a better method out of the two? Using train_cats to extract category codes or using get_dummies?


i have added Kaggle kernel for lesson 1 link into lesson resources section so that anyone can run lesson 1 by forking this kernel (need free Kaggle account only).

(Xoel López) #31

Doesn’t replace the NaN values by the median may incur in looakahead bias? Shouldn’t it be better to do fillna(method=‘pad’) or replace by the rolling median? Or is the effect negligible?


(chandan) #32

I am getting a Memory error when trying to fit lesson 1.Any ideas?
I am running whole thing in AWS EC2 micro instance.

MemoryError Traceback (most recent call last)
1 m = RandomForestRegressor(n_jobs=-1)
----> 2, y)
3 m.score(df,y)

Solved it, when you are using free tier you can’t expect to train on whole dataset on so low ram, try fitting on 20-50k latter parts of code where jeremy prototypes on small sets.

(Tabish Shaikh) #33

When we use proc_df it returns x,y,nas. I read the documentation of nas and it says nas: returns a dictionary of which nas it created, and the associated median. Can someone explain what could be a future use of nas

(AnthonyC) #34

I have quick question: when I try to Launch instance in AWS and search for Ubuntu free tier, I only see three option “Ubuntu Server 18.04 LTS (HVM) SSD Volume Type; .NET Core 2.1 with Ubuntu Server 18.04 - Version 1.0; Ubuntu Server 16.04 LTS (HVM) SSD Volume Type”, which one I should pick for Ubuntu for this course? Thanks!

(chandan) #35

i am using below one and it is working fine till now.
Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - ami-0d773a3b7bb2bb1c1

You need to select micro instance and storage as ssd 30gb to be in free tier.Also do not select any paid services like extra monitoring static ip etc else you will incur costs.
Btw keep in mind ,when you run things free tier micro you will never be able to train on whole dataset , so you need to keep number of samples low every time.

(AnthonyC) #36

Thanks! BTW, does anyone use “fastai-part1v2-p2 - ami-8c4288f4” before? It is free tier as well, what the difference comparing "Ubuntu Server 18.04 LTS "?

(Gaurav Kolekar) #37

ModuleNotFoundError Traceback (most recent call last)
----> 1 from fastai.imports import *
2 from fastai.structured import *
4 from pandas_summary import DataFrameSummary
5 from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

ModuleNotFoundError: No module named ‘fastai’
I get this error when I try to execute the imports. Please suggest. I following the installation steps as mentioned by Jeremy.

(Manish Kumar) #38

I think you should try
df[col] in place of df.col

Let me know if that worked.

(Manish Kumar) #39

As Jeremy said in lecture, get_dummies will create three cols with with 1 and 0 as values, but you will not find any order among these categories. In the bulldozer example, the category has an order like High > Medium > Low so we need an ordered numerical representation, so train_cats does this for us marking values as 2,1 and 0 respectively.