Wiki/lesson thread: Lesson 3


#22

Is it normal that I get memory erros on a p4000 machine in paperspace? If so, upgrading to a p5000 machine would fix the problem?


#23

Hi guys, I’just like @divon, I’m having similar “MemoryError” issues when running proc_df for the Favorita groceries dataset. I’ve tried downgrading pandas to 0.20.3 but still to no avail. Been plagued by this problem for a long time and can’t seem to find a solution anywhere :face_with_raised_eyebrow:

The error (truncated) looks like:

----------------------------------------------------
MemoryError        Traceback (most recent call last)
<timed exec> in <module>

~/fastai/courses/ml1/fastai/structured.py in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    448     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    449     df = pd.get_dummies(df, dummy_na=True)
--> 450     df = pd.concat([ignored_flds, df], axis=1)
    451     res = [df, y, na_dict]
    452     if do_scale: res = res + [mapper]

~/.conda/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    223                        keys=keys, levels=levels, names=names,
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()
    227 

I saw some suggestions to run the process in “chunks” but not sure how to go about that either with proc_df. Can somebody help, please? Thanks!

ps: I’m running on gcloud compute instance with 8 vCPUs (Intel Broadwell), 52 GB mem, 1 x NVIDIA Tesla K80. Been able to successfully run other datasets such as bulldogs as well as some other Kaggle competition datasets


#24

Hi,

Could someone explain to me what the variables are in ax2.plot(x, m2*x + b2) when testing the validation set against Kaggle Scores?


(Nayan) #25

Hi @jeremy @terrance @chitreddysairam
I am following the machine learning class . I had a doubt in lesson 3 which refers to the notebook lesson-2-rf_interpretation.ipynb . I came across this code block, when you are talking about confidence intervals and feature importance.

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)
x.Enclosure.value_counts().plot.barh();

In the above code block what is raw_valid, where is it generated ? Is it something like

_,raw_valid = split_vals(df_raw,n_trn)

Also can someone point me on the best practices on formatting the posts


(Vishal Srivastava) #26

Suppose if I have two data sets, train and test. In the train set, I have a categorical column, Country, which has 3 distinct categories, whereas in the test set, for the same column, i have only two unique categories.

So, if I run proc_df with max_n_cat = 5 on the train , Country column will get converted into three binary columns. Likewise running the same on test will convert Country column into two binary columns. That means, now we have a mismatch of count of columns between train and test, and this mismatch can cause problems while predicting values of test data set.

Do we have any solution for this? If it is already covered, can someone direct me towards it?

As of now, I am merging train and test , and then executing proc_df to get equal number of columns, but this approach makes na_dict void.


(C Sairam Sandeep) #27

After identifying the groups for which confidence interval is not so good, what steps we need to take to correct the model.? How to tweak our model so that only these groups will be affected??

I have faced a similar problem when I tried the Titanic problem on Kaggle. Once I train the model and fit it to a validation set, Then I extracted information on False positives and False Negatives.
I have seen some particular combination of features appearing in either sets. But I donot know how to proceed after this. i.e How to tweak the model after I have this knowledge???

Thankyou


#28

you can use apply_cats to do the same, and here is documentation for same –

“”"Changes any columns of strings in df into categorical variables using trn as
a template for the category codes.

Parameters:
-----------
df: A pandas dataframe. Any columns of strings will be changed to
    categorical values. The category codes are determined by trn.

trn: A pandas dataframe. When creating a category for df, it looks up the
    what the category's code were in trn and makes those the category codes
    for df.

(Vishal Srivastava) #29

apply_cats didn’t serve the purpose. :frowning:

df = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘a’, ‘b’,‘a’]})
df
col1 col2
0 1 a
1 2 b
2 3 a

train_cats(df)
df2 = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘b’, ‘a’, ‘e’]})
df2
col1 col2
0 1 b
1 2 a
2 3 e

e isn’t present in df

apply_cats(df2, df)

df2
col1 col2
0 1 b
1 2 a
2 3 NaN

e is replaced by NaN.

If a new value gets change into Nan while running model on the test data, doesn’t it impact the performance ?


#30

this is happening due to fact that training data is not a true representation of testing data, I am not sure of how valid the results are going to be …
as a workaround to get started … combine the categorical column in training and test dataset, then run train_cats on result, and then use the resultant DF to apply_cats() on both training and testing –

so in this example –
df_c = pd.concat([df_train[‘col2’], df_test[‘col2’] ], ignore_index=True)
train_cats(df_c)

and then
apply_cats(df_train, df_c)
apply_cats(df_test, df_c)


(SAKA RICKY) #31

Please guys when trying to read the data from the train.csv file

%%time
df_all = pd.read_csv(f’{PATH}train.csv’, parse_dates=[‘date’], dtype=types,
infer_datetime_format=True)

and parsing the dictionary of types,
types = {‘id’: ‘int64’,
‘item_nbr’: ‘int32’,
‘store_nbr’: ‘int8’,
‘unit_sales’: ‘float32’,
‘onpromotion’: ‘object’}
I get the following error

ValueError: Integer column has NA values in column 2


(abhi) #33

There’s some problem while I was trying to call add_datepart. Kernel quickly takes 17GB RAM and then it dies. There are too many records. Is that the problem?


#34

@abhimanyuaryan,

I think I may have the saem issue as yourself. I’m not sure which line of code is affected but my Google Colab crashed. May I know if you’re able to solve this? Thanks.

df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None)) 
add_datepart(df_all, 'date')

(abhi) #35

@andrew77 nope andrew I got busy with other problems. I totally missed to solve it. However you can follow the thread here: https://datascience.stackexchange.com/questions/45089/operating-on-a-dataset-with-125-497-040-records

I wrote some code to breakout down the dataset. It is buggy. I didn’t solve it but people have given the right guidance. See if you can fix following their instructions