Wiki/lesson thread: Lesson 3


Is it normal that I get memory erros on a p4000 machine in paperspace? If so, upgrading to a p5000 machine would fix the problem?



Hi guys, I’just like @divon, I’m having similar “MemoryError” issues when running proc_df for the Favorita groceries dataset. I’ve tried downgrading pandas to 0.20.3 but still to no avail. Been plagued by this problem for a long time and can’t seem to find a solution anywhere :face_with_raised_eyebrow:

The error (truncated) looks like:

MemoryError        Traceback (most recent call last)
<timed exec> in <module>

~/fastai/courses/ml1/fastai/ in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    448     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    449     df = pd.get_dummies(df, dummy_na=True)
--> 450     df = pd.concat([ignored_flds, df], axis=1)
    451     res = [df, y, na_dict]
    452     if do_scale: res = res + [mapper]

~/.conda/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/ in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    223                        keys=keys, levels=levels, names=names,
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()

I saw some suggestions to run the process in “chunks” but not sure how to go about that either with proc_df. Can somebody help, please? Thanks!

ps: I’m running on gcloud compute instance with 8 vCPUs (Intel Broadwell), 52 GB mem, 1 x NVIDIA Tesla K80. Been able to successfully run other datasets such as bulldogs as well as some other Kaggle competition datasets




Could someone explain to me what the variables are in ax2.plot(x, m2*x + b2) when testing the validation set against Kaggle Scores?


(Nayan) #25

Hi @jeremy @terrance @chitreddysairam
I am following the machine learning class . I had a doubt in lesson 3 which refers to the notebook lesson-2-rf_interpretation.ipynb . I came across this code block, when you are talking about confidence intervals and feature importance.

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)

In the above code block what is raw_valid, where is it generated ? Is it something like

_,raw_valid = split_vals(df_raw,n_trn)

Also can someone point me on the best practices on formatting the posts


(Vishal Srivastava) #26

Suppose if I have two data sets, train and test. In the train set, I have a categorical column, Country, which has 3 distinct categories, whereas in the test set, for the same column, i have only two unique categories.

So, if I run proc_df with max_n_cat = 5 on the train , Country column will get converted into three binary columns. Likewise running the same on test will convert Country column into two binary columns. That means, now we have a mismatch of count of columns between train and test, and this mismatch can cause problems while predicting values of test data set.

Do we have any solution for this? If it is already covered, can someone direct me towards it?

As of now, I am merging train and test , and then executing proc_df to get equal number of columns, but this approach makes na_dict void.

1 Like

(C Sairam Sandeep) #27

After identifying the groups for which confidence interval is not so good, what steps we need to take to correct the model.? How to tweak our model so that only these groups will be affected??

I have faced a similar problem when I tried the Titanic problem on Kaggle. Once I train the model and fit it to a validation set, Then I extracted information on False positives and False Negatives.
I have seen some particular combination of features appearing in either sets. But I donot know how to proceed after this. i.e How to tweak the model after I have this knowledge???




you can use apply_cats to do the same, and here is documentation for same –

“”"Changes any columns of strings in df into categorical variables using trn as
a template for the category codes.

df: A pandas dataframe. Any columns of strings will be changed to
    categorical values. The category codes are determined by trn.

trn: A pandas dataframe. When creating a category for df, it looks up the
    what the category's code were in trn and makes those the category codes
    for df.

(Vishal Srivastava) #29

apply_cats didn’t serve the purpose. :frowning:

df = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘a’, ‘b’,‘a’]})
col1 col2
0 1 a
1 2 b
2 3 a

df2 = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘b’, ‘a’, ‘e’]})
col1 col2
0 1 b
1 2 a
2 3 e

e isn’t present in df

apply_cats(df2, df)

col1 col2
0 1 b
1 2 a
2 3 NaN

e is replaced by NaN.

If a new value gets change into Nan while running model on the test data, doesn’t it impact the performance ?



this is happening due to fact that training data is not a true representation of testing data, I am not sure of how valid the results are going to be …
as a workaround to get started … combine the categorical column in training and test dataset, then run train_cats on result, and then use the resultant DF to apply_cats() on both training and testing –

so in this example –
df_c = pd.concat([df_train[‘col2’], df_test[‘col2’] ], ignore_index=True)

and then
apply_cats(df_train, df_c)
apply_cats(df_test, df_c)



Please guys when trying to read the data from the train.csv file

df_all = pd.read_csv(f’{PATH}train.csv’, parse_dates=[‘date’], dtype=types,

and parsing the dictionary of types,
types = {‘id’: ‘int64’,
‘item_nbr’: ‘int32’,
‘store_nbr’: ‘int8’,
‘unit_sales’: ‘float32’,
‘onpromotion’: ‘object’}
I get the following error

ValueError: Integer column has NA values in column 2


(abhi) #33

There’s some problem while I was trying to call add_datepart. Kernel quickly takes 17GB RAM and then it dies. There are too many records. Is that the problem?




I think I may have the saem issue as yourself. I’m not sure which line of code is affected but my Google Colab crashed. May I know if you’re able to solve this? Thanks.

df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None)) 
add_datepart(df_all, 'date')

(abhi) #35

@andrew77 nope andrew I got busy with other problems. I totally missed to solve it. However you can follow the thread here:

I wrote some code to breakout down the dataset. It is buggy. I didn’t solve it but people have given the right guidance. See if you can fix following their instructions

1 Like


Hi, @pvardanis,
have you solved the parallel_trees() problem?



Hey all,

I know this course was posted a while ago, but I’m finding it to be impossible to find the functions in the newest version of the library. Where are both proc_df and set_rf_samples? Is downgrading to an older version the only way to find these? Any reason why they were removed?


(Aditya) #39

when i am clicking the above links i get ERROR 404


(Aditya) #40

getting the same issue :frowning:
did you find any solution to this issue??



Melissa wrote as follows:

How does 1 decision tree in default random forest takes sub sample or does it train on complete data?

If bootstrapping = False then it takes all samples without replacement. So it will have all the raws. If bootstrapping = True then it will take len(df) rows but with replacement. So there will be duplicates which make each tree different. Default is True

Simply, “all the raws” = len(df)?


(Jetze Baumfalk) #42

The complete API was overhauled for the 1.0 release. The old codebase is put in the ‘old’ folder on Github. Jeremy posted some instructions on how to still use the old codebase here.


(Bhoumik) #43

Is feature_importance reliable if the feature value is skewed? (i.e: large number of rows have same value for a particular feature).

Let’s say 85% of rows have value 1 for a particular feature. Now after random shuffling of that column, value of 85% of rows will remain unchanged. Hence r^2 value might not change significantly. Resulting in small feature importance.