Wiki/lesson thread: Lesson 3

pvardanis · December 16, 2018, 9:23pm

Is it normal that I get memory erros on a p4000 machine in paperspace? If so, upgrading to a p5000 machine would fix the problem?

bengsoon · December 21, 2018, 9:24am

Hi guys, I’just like @divon, I’m having similar “MemoryError” issues when running proc_df for the Favorita groceries dataset. I’ve tried downgrading pandas to 0.20.3 but still to no avail. Been plagued by this problem for a long time and can’t seem to find a solution anywhere

The error (truncated) looks like:

----------------------------------------------------
MemoryError        Traceback (most recent call last)
<timed exec> in <module>

~/fastai/courses/ml1/fastai/structured.py in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    448     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    449     df = pd.get_dummies(df, dummy_na=True)
--> 450     df = pd.concat([ignored_flds, df], axis=1)
    451     res = [df, y, na_dict]
    452     if do_scale: res = res + [mapper]

~/.conda/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    223                        keys=keys, levels=levels, names=names,
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()
    227

I saw some suggestions to run the process in “chunks” but not sure how to go about that either with proc_df. Can somebody help, please? Thanks!

ps: I’m running on gcloud compute instance with 8 vCPUs (Intel Broadwell), 52 GB mem, 1 x NVIDIA Tesla K80. Been able to successfully run other datasets such as bulldogs as well as some other Kaggle competition datasets

fastai1 · December 29, 2018, 1:41pm

Hi,

Could someone explain to me what the variables are in ax2.plot(x, m2*x + b2) when testing the validation set against Kaggle Scores?

nayang · January 14, 2019, 8:27am

Hi @jeremy @terrance @chitreddysairam
I am following the machine learning class . I had a doubt in lesson 3 which refers to the notebook lesson-2-rf_interpretation.ipynb . I came across this code block, when you are talking about confidence intervals and feature importance.

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)
x.Enclosure.value_counts().plot.barh();

In the above code block what is raw_valid, where is it generated ? Is it something like

_,raw_valid = split_vals(df_raw,n_trn)

Also can someone point me on the best practices on formatting the posts

srivvish · January 16, 2019, 12:30pm

Suppose if I have two data sets, train and test. In the train set, I have a categorical column, Country, which has 3 distinct categories, whereas in the test set, for the same column, i have only two unique categories.

So, if I run proc_df with max_n_cat = 5 on the train , Country column will get converted into three binary columns. Likewise running the same on test will convert Country column into two binary columns. That means, now we have a mismatch of count of columns between train and test, and this mismatch can cause problems while predicting values of test data set.

Do we have any solution for this? If it is already covered, can someone direct me towards it?

As of now, I am merging train and test , and then executing proc_df to get equal number of columns, but this approach makes na_dict void.

chitreddysairam · January 18, 2019, 4:27am

After identifying the groups for which confidence interval is not so good, what steps we need to take to correct the model.? How to tweak our model so that only these groups will be affected??

I have faced a similar problem when I tried the Titanic problem on Kaggle. Once I train the model and fit it to a validation set, Then I extracted information on False positives and False Negatives.
I have seen some particular combination of features appearing in either sets. But I donot know how to proceed after this. i.e How to tweak the model after I have this knowledge???

Thankyou

zerosub0 · January 26, 2019, 7:32pm

you can use apply_cats to do the same, and here is documentation for same –

“”"Changes any columns of strings in df into categorical variables using trn as
a template for the category codes.

Parameters:
-----------
df: A pandas dataframe. Any columns of strings will be changed to
    categorical values. The category codes are determined by trn.

trn: A pandas dataframe. When creating a category for df, it looks up the
    what the category's code were in trn and makes those the category codes
    for df.

srivvish · January 27, 2019, 1:24am

apply_cats didn’t serve the purpose.

df = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘a’, ‘b’,‘a’]})
df
col1 col2
0 1 a
1 2 b
2 3 a
train_cats(df)
df2 = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘b’, ‘a’, ‘e’]})
df2
col1 col2
0 1 b
1 2 a
2 3 e

e isn’t present in df

apply_cats(df2, df)

df2
col1 col2
0 1 b
1 2 a
2 3 NaN

e is replaced by NaN.

If a new value gets change into Nan while running model on the test data, doesn’t it impact the performance ?

zerosub0 · January 28, 2019, 8:07am

this is happening due to fact that training data is not a true representation of testing data, I am not sure of how valid the results are going to be …
as a workaround to get started … combine the categorical column in training and test dataset, then run train_cats on result, and then use the resultant DF to apply_cats() on both training and testing –

so in this example –
df_c = pd.concat([df_train[‘col2’], df_test[‘col2’] ], ignore_index=True)
train_cats(df_c)
…
and then
apply_cats(df_train, df_c)
apply_cats(df_test, df_c)

ricky_saka · January 30, 2019, 2:52pm

Please guys when trying to read the data from the train.csv file

%%time
df_all = pd.read_csv(f’{PATH}train.csv’, parse_dates=[‘date’], dtype=types,
infer_datetime_format=True)

and parsing the dictionary of types,
types = {‘id’: ‘int64’,
‘item_nbr’: ‘int32’,
‘store_nbr’: ‘int8’,
‘unit_sales’: ‘float32’,
‘onpromotion’: ‘object’}
I get the following error

ValueError: Integer column has NA values in column 2

abhimanyuaryan · February 4, 2019, 3:50am

There’s some problem while I was trying to call add_datepart. Kernel quickly takes 17GB RAM and then it dies. There are too many records. Is that the problem?

andrew77 · February 19, 2019, 5:03am

@abhimanyuaryan,

I think I may have the saem issue as yourself. I’m not sure which line of code is affected but my Google Colab crashed. May I know if you’re able to solve this? Thanks.

df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None)) 
add_datepart(df_all, 'date')

abhimanyuaryan · February 19, 2019, 8:46am

@andrew77 nope andrew I got busy with other problems. I totally missed to solve it. However you can follow the thread here: https://datascience.stackexchange.com/questions/45089/operating-on-a-dataset-with-125-497-040-records

I wrote some code to breakout down the dataset. It is buggy. I didn’t solve it but people have given the right guidance. See if you can fix following their instructions

sun · April 25, 2019, 1:51am

Hi, @pvardanis,
have you solved the parallel_trees() problem?

dchisey · April 30, 2019, 4:41am

Hey all,

I know this course was posted a while ago, but I’m finding it to be impossible to find the functions in the newest version of the fast.ai library. Where are both proc_df and set_rf_samples? Is downgrading to an older version the only way to find these? Any reason why they were removed?

adityavermabm · May 1, 2019, 4:57pm

when i am clicking the above links i get ERROR 404

adityavermabm · May 1, 2019, 4:59pm

getting the same issue
did you find any solution to this issue??

tezzytezzy · June 16, 2019, 1:07pm

Melissa wrote as follows:

+++
How does 1 decision tree in default random forest takes sub sample or does it train on complete data?

If bootstrapping = False then it takes all samples without replacement. So it will have all the raws. If bootstrapping = True then it will take len(df) rows but with replacement. So there will be duplicates which make each tree different. Default is True
+++

Simply, “all the raws” = len(df)?

Jetze · June 28, 2019, 9:52am

The complete fast.ai API was overhauled for the 1.0 release. The old codebase is put in the ‘old’ folder on Github. Jeremy posted some instructions on how to still use the old codebase here.

bshah · July 24, 2019, 9:47am

Is feature_importance reliable if the feature value is skewed? (i.e: large number of rows have same value for a particular feature).

Let’s say 85% of rows have value 1 for a particular feature. Now after random shuffling of that column, value of 85% of rows will remain unchanged. Hence r^2 value might not change significantly. Resulting in small feature importance.