Wiki thread: lesson 1

I am still having the problem of the kernel restarting at the proc_df line in “lesson1-rf” on my paperspace machine. The crash appears to be triggered by the 4th line of code in proc_df:

else: df = df.copy()

Does that ring any bells with anyone?

I’m not having any problems running the notebook on my (v. slow) notebook, so I am slogging ahead with the lesson.

Thanks for any help!

Looks like it is the ‘feathered’ version of the dataframe that is causing the crash.

For now, instead of saving as feather format. I’m using a pickled version and it seems to work.:
At the end of Initial processing

os.makedirs(‘tmp’, exist_ok=True)
df_raw.to_pickle(‘tmp/bulldozers-pkl’)

then, in pre-processing:

df_raw = pd.read_pickle(‘tmp/bulldozers-pkl’)

Also, I reinstalled feather with:

conda install -c conda-forge feather-format

and that seems to have eliminated the problem (though there are new deprecation warnings).

4 Likes

Thanks for sharing the workaround. I will try this and see if this works for me.

How does one submit to Kaggle for example in the House Prices competition?

Someone earlier answered by linking to the 3rd lesson of DL1, but it does not help much for the machine learning category.

Once I have my model which predicts ‘SalePrice’, what do I have to do to get the predictions for each house and save them to a CSV file along with the houses’ ids?

1 Like

You might want to open a new thread.

But maybe this explanation of the expected format already helps: https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation

If you still look for an answer: you need to access the column with brackets.

df_raw[col]

While looking at the pandas documentation, I see a method called “get_dummies”:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

which can convert categorical values to indicator/dummy variables.

I ran it on the bulldozer dataset and the output is similar to “one hot encoding”.

So, I am wondering - which is a better method out of the two? Using train_cats to extract category codes or using get_dummies?

i have added Kaggle kernel for lesson 1 link into lesson resources section so that anyone can run lesson 1 by forking this kernel (need free Kaggle account only).

Doesn’t replace the NaN values by the median may incur in looakahead bias? Shouldn’t it be better to do fillna(method=‘pad’) or replace by the rolling median? Or is the effect negligible?

Thanks!!

I am getting a Memory error when trying to fit lesson 1.Any ideas?
I am running whole thing in AWS EC2 micro instance.


MemoryError Traceback (most recent call last)
in
1 m = RandomForestRegressor(n_jobs=-1)
----> 2 m.fit(df, y)
3 m.score(df,y)

Solved it, when you are using free tier you can’t expect to train on whole dataset on so low ram, try fitting on 20-50k samples.run latter parts of code where jeremy prototypes on small sets.

1 Like

When we use proc_df it returns x,y,nas. I read the documentation of nas and it says nas: returns a dictionary of which nas it created, and the associated median. Can someone explain what could be a future use of nas

Moring,
I have quick question: when I try to Launch instance in AWS and search for Ubuntu free tier, I only see three option “Ubuntu Server 18.04 LTS (HVM) SSD Volume Type; .NET Core 2.1 with Ubuntu Server 18.04 - Version 1.0; Ubuntu Server 16.04 LTS (HVM) SSD Volume Type”, which one I should pick for Ubuntu for this course? Thanks!

i am using below one and it is working fine till now.
Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - ami-0d773a3b7bb2bb1c1

You need to select micro instance and storage as ssd 30gb to be in free tier.Also do not select any paid services like extra monitoring static ip etc else you will incur costs.
Btw keep in mind ,when you run things free tier micro you will never be able to train on whole dataset , so you need to keep number of samples low every time.

Thanks! BTW, does anyone use “fastai-part1v2-p2 - ami-8c4288f4” before? It is free tier as well, what the difference comparing "Ubuntu Server 18.04 LTS "?

I think you should try
df[col] in place of df.col

Let me know if that worked.

As Jeremy said in lecture, get_dummies will create three cols with with 1 and 0 as values, but you will not find any order among these categories. In the bulldozer example, the category has an order like High > Medium > Low so we need an ordered numerical representation, so train_cats does this for us marking values as 2,1 and 0 respectively.

ImportError Traceback (most recent call last)
in
----> 1 from fastai.imports import *
2 from fastai.structured import *
3
4 from pandas_summary import DataFrameSummary
5 from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

~\Miniconda3\envs\fastai\lib\site-packages\fastai_init_.py in
----> 1 from .basic_train import *
2 from .callback import *
3 from .callbacks import *
4 from .core import *
5 from .basic_data import *

~\Miniconda3\envs\fastai\lib\site-packages\fastai\basic_train.py in
1 "Provides basic training and validation with Learner"
----> 2 from .torch_core import *
3 from .basic_data import *
4 from .callback import *
5

~\Miniconda3\envs\fastai\lib\site-packages\fastai\torch_core.py in
1 “Utility functions to help deal with tensors”
----> 2 from .imports.torch import *
3 from .core import *
4
5 AffineMatrix = Tensor

~\Miniconda3\envs\fastai\lib\site-packages\fastai\imports_init_.py in
1 from .core import *
----> 2 from .torch import *

~\Miniconda3\envs\fastai\lib\site-packages\fastai\imports\torch.py in
----> 1 import torch, torch.nn.functional as F
2 from torch import ByteTensor, DoubleTensor, FloatTensor, HalfTensor, LongTensor, ShortTensor, Tensor
3 from torch import nn, optim, as_tensor
4 from torch.utils.data import BatchSampler, DataLoader, Dataset, Sampler, TensorDataset

~\Miniconda3\envs\fastai\lib\site-packages\torch_init_.py in
74 pass
75
—> 76 from torch._C import *
77
78 all += [name for name in dir(_C)

ImportError: DLL load failed: The specified module could not be found.

Anybody else facing this error please let me know.

Hello,

This post is intended for the people who might be interested in following this course in Google Colab.

I have written the steps I have taken to set up a workplace in Google Colab as well as how to download and interact with the data using Kaggle’s API.

From there, you will be able to follow along the course and notebook. Here is the link and I hope this will be helpful.

1 Like

I am also facing the same issue

In the first lesson of the Intro to Machine Learning Course, the Jupyter Notebook has a variable

PATH = “data/bulldozers/”

But I did not find any directory named data in the “ml1” folder.
Do we have to download a Kaggle dataset for the course or have I missed out on anything?
Thank you

1 Like