Wiki/lesson thread: Lesson 3


Is it normal that I get memory erros on a p4000 machine in paperspace? If so, upgrading to a p5000 machine would fix the problem?


Hi guys, I’just like @divon, I’m having similar “MemoryError” issues when running proc_df for the Favorita groceries dataset. I’ve tried downgrading pandas to 0.20.3 but still to no avail. Been plagued by this problem for a long time and can’t seem to find a solution anywhere :face_with_raised_eyebrow:

The error (truncated) looks like:

MemoryError        Traceback (most recent call last)
<timed exec> in <module>

~/fastai/courses/ml1/fastai/ in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    448     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    449     df = pd.get_dummies(df, dummy_na=True)
--> 450     df = pd.concat([ignored_flds, df], axis=1)
    451     res = [df, y, na_dict]
    452     if do_scale: res = res + [mapper]

~/.conda/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/ in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    223                        keys=keys, levels=levels, names=names,
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()

I saw some suggestions to run the process in “chunks” but not sure how to go about that either with proc_df. Can somebody help, please? Thanks!

ps: I’m running on gcloud compute instance with 8 vCPUs (Intel Broadwell), 52 GB mem, 1 x NVIDIA Tesla K80. Been able to successfully run other datasets such as bulldogs as well as some other Kaggle competition datasets



Could someone explain to me what the variables are in ax2.plot(x, m2*x + b2) when testing the validation set against Kaggle Scores?

(Nayan) #25

Hi @jeremy @terrance , I am following the machine learning class . I had a doubt in lesson 3 which refers to the notebook lesson-2-rf_interpretation.ipynb . I came across this code block, when you are talking about confidence intervals and feature importance.

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)

In the above code block what is raw_valid, where is it generated ? Is it something like

_,raw_valid = split_vals(df_raw,n_trn)

Also can someone point me on the best practices on formatting the posts

(Vishal Srivastava) #26

Suppose if I have two data sets, train and test. In the train set, I have a categorical column, Country, which has 3 distinct categories, whereas in the test set, for the same column, i have only two unique categories.

So, if I run proc_df with max_n_cat = 5 on the train , Country column will get converted into three binary columns. Likewise running the same on test will convert Country column into two binary columns. That means, now we have a mismatch of count of columns between train and test, and this mismatch can cause problems while predicting values of test data set.

Do we have any solution for this? If it is already covered, can someone direct me towards it?

As of now, I am merging train and test , and then executing proc_df to get equal number of columns, but this approach makes na_dict void.

(C Sairam Sandeep) #27

After identifying the groups for which confidence interval is not so good, what steps we need to take to correct the model.? How to tweak our model so that only these groups will be affected??

I have faced a similar problem when I tried the Titanic problem on Kaggle. Once I train the model and fit it to a validation set, Then I extracted information on False positives and False Negatives.
I have seen some particular combination of features appearing in either sets. But I donot know how to proceed after this. i.e How to tweak the model after I have this knowledge???