Wiki/lesson thread: Lesson 3

Hi @jeremy, I have a question about feature subsampling. The lesson 1 ipynb notes that the max_features() parameter is used to specify the number of columns used per split, but I believe I read in Elements of Statistical Learning that feature subsampling is done by randomly sampling different features per tree. Can you comment on which approach is better? Thanks

1 Like

I think the first approach should be better…cause imagine a case where you have let’s say 10 total features…now if you are even randomly passing 5 features to each tree…there is high chance that 2 trees may get the same 5 features and then their predictions will be exactly the same…which we don’t want because we need diverse trees.
Also another caveat is the tree will be as good as the features passed to it…so imagine a tree where you passed a set of features that are not at all important (in terms of feature importance) then that tree will try to find patterns that do not exist and the predictions from that tree might become completely useless. Basically that tree doesn’t have access to the “important” variables at all.

But having said that let’s wait for Jeremy to answer, maybe my understanding is wrong.

1 Like

Hi @jeremy, I have some additional doubt regarding feature selection. Let’s say I build a random forest by selecting max_features = 0.5 at each split. Then when I am calculating feature importance, isn’t it a little biased considering the fact that my optimal split is based on only 50% of the features? Or can we say that the optimal split may be biased but as we have so many trees and randomness at each split, this bias gets eliminated when we aggregate the feature importance across all trees?

1 Like

The word “bias” doesn’t mean the same thing as “stochastic” - so no, we wouldn’t say “biased”. (That would imply that it tends to push results in some specific direction, rather than a random direction). But the second part of your comment is correct.

1 Like

I’m watching Lesson 3 and Jeremy is talking about adding columns (features) with relevant data, like maybe weather, holidays etc.

This means we will train our model on a superset of features, compared to the ones provided by the Kaggle competition. How does this work? Wouldn’t we need to have these extra features when we predict values based on the model? Let’s say that I train my model and I’ve added a new boolean feature called “Store distance from airport”. When I submit my model to Kaggle, how will they know to include that data when they check my model against their validation set? Because that would be a parameter needed to make a correct prediction.

1 Like

While running line 14 ( trn, y = proc_df(train, ‘unit_sales’) ) ) of the jupyter notebook related to the grocery store data (see 28:30 of the video), you may experience a memory error. For me, this was resolved by rolling back my pandas version from 0.23.4 to 0.20.3. If you’re using anaconda and setup the environment as described in " Setting up your computer if it already has Anaconda installed " at Wiki thread: lesson 1, you can activate the anaconda fastai environment (conda activate fastai), then run pip install pandas==0.20.3

I’m sure there are better ways to address the issue, but this worked for me. Note that you’ll also need to revise the line from “trn, y = proc_df(train, ‘unit_sales’)” to “trn, y, nas = proc_df(train, ‘unit_sales’)” due to the update in the proc_df code, which now returns NAs.

Also, when fitting the models, if I used n_jobs = -1, I received a memory error, but when I used only 4 of my computer’s 6 cores, everything worked fine.

Hi, I came up with the following errors during lesson 3. I’d appreciate any help.

  1. changing pandas to 0.20.3 version didn’t fix the issue. Any help? Running on a p4000 machine on paperspace.

  2. Also, I get memory errors in the following part:

%time add_datepart(df_all, ‘date’)

  1. Calling parallel_trees() gets me the following warnings:

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Is it normal that I get memory erros on a p4000 machine in paperspace? If so, upgrading to a p5000 machine would fix the problem?

Hi guys, I’just like @divon, I’m having similar “MemoryError” issues when running proc_df for the Favorita groceries dataset. I’ve tried downgrading pandas to 0.20.3 but still to no avail. Been plagued by this problem for a long time and can’t seem to find a solution anywhere :face_with_raised_eyebrow:

The error (truncated) looks like:

----------------------------------------------------
MemoryError        Traceback (most recent call last)
<timed exec> in <module>

~/fastai/courses/ml1/fastai/structured.py in proc_df(df, y_fld, skip_flds, ignore_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    448     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    449     df = pd.get_dummies(df, dummy_na=True)
--> 450     df = pd.concat([ignored_flds, df], axis=1)
    451     res = [df, y, na_dict]
    452     if do_scale: res = res + [mapper]

~/.conda/envs/fastai/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    223                        keys=keys, levels=levels, names=names,
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()
    227 

I saw some suggestions to run the process in “chunks” but not sure how to go about that either with proc_df. Can somebody help, please? Thanks!

ps: I’m running on gcloud compute instance with 8 vCPUs (Intel Broadwell), 52 GB mem, 1 x NVIDIA Tesla K80. Been able to successfully run other datasets such as bulldogs as well as some other Kaggle competition datasets

Hi,

Could someone explain to me what the variables are in ax2.plot(x, m2*x + b2) when testing the validation set against Kaggle Scores?

Hi @jeremy @terrance @chitreddysairam
I am following the machine learning class . I had a doubt in lesson 3 which refers to the notebook lesson-2-rf_interpretation.ipynb . I came across this code block, when you are talking about confidence intervals and feature importance.

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)
x.Enclosure.value_counts().plot.barh();

In the above code block what is raw_valid, where is it generated ? Is it something like

_,raw_valid = split_vals(df_raw,n_trn)

Also can someone point me on the best practices on formatting the posts

Suppose if I have two data sets, train and test. In the train set, I have a categorical column, Country, which has 3 distinct categories, whereas in the test set, for the same column, i have only two unique categories.

So, if I run proc_df with max_n_cat = 5 on the train , Country column will get converted into three binary columns. Likewise running the same on test will convert Country column into two binary columns. That means, now we have a mismatch of count of columns between train and test, and this mismatch can cause problems while predicting values of test data set.

Do we have any solution for this? If it is already covered, can someone direct me towards it?

As of now, I am merging train and test , and then executing proc_df to get equal number of columns, but this approach makes na_dict void.

1 Like

After identifying the groups for which confidence interval is not so good, what steps we need to take to correct the model.? How to tweak our model so that only these groups will be affected??

I have faced a similar problem when I tried the Titanic problem on Kaggle. Once I train the model and fit it to a validation set, Then I extracted information on False positives and False Negatives.
I have seen some particular combination of features appearing in either sets. But I donot know how to proceed after this. i.e How to tweak the model after I have this knowledge???

Thankyou

you can use apply_cats to do the same, and here is documentation for same –

“”"Changes any columns of strings in df into categorical variables using trn as
a template for the category codes.

Parameters:
-----------
df: A pandas dataframe. Any columns of strings will be changed to
    categorical values. The category codes are determined by trn.

trn: A pandas dataframe. When creating a category for df, it looks up the
    what the category's code were in trn and makes those the category codes
    for df.

apply_cats didn’t serve the purpose. :frowning:

df = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘a’, ‘b’,‘a’]})
df
col1 col2
0 1 a
1 2 b
2 3 a

train_cats(df)
df2 = pd.DataFrame({‘col1’ : [1, 2, 3], ‘col2’ : [‘b’, ‘a’, ‘e’]})
df2
col1 col2
0 1 b
1 2 a
2 3 e

e isn’t present in df

apply_cats(df2, df)

df2
col1 col2
0 1 b
1 2 a
2 3 NaN

e is replaced by NaN.

If a new value gets change into Nan while running model on the test data, doesn’t it impact the performance ?

this is happening due to fact that training data is not a true representation of testing data, I am not sure of how valid the results are going to be …
as a workaround to get started … combine the categorical column in training and test dataset, then run train_cats on result, and then use the resultant DF to apply_cats() on both training and testing –

so in this example –
df_c = pd.concat([df_train[‘col2’], df_test[‘col2’] ], ignore_index=True)
train_cats(df_c)

and then
apply_cats(df_train, df_c)
apply_cats(df_test, df_c)

Please guys when trying to read the data from the train.csv file

%%time
df_all = pd.read_csv(f’{PATH}train.csv’, parse_dates=[‘date’], dtype=types,
infer_datetime_format=True)

and parsing the dictionary of types,
types = {‘id’: ‘int64’,
‘item_nbr’: ‘int32’,
‘store_nbr’: ‘int8’,
‘unit_sales’: ‘float32’,
‘onpromotion’: ‘object’}
I get the following error

ValueError: Integer column has NA values in column 2

There’s some problem while I was trying to call add_datepart. Kernel quickly takes 17GB RAM and then it dies. There are too many records. Is that the problem?

@abhimanyuaryan,

I think I may have the saem issue as yourself. I’m not sure which line of code is affected but my Google Colab crashed. May I know if you’re able to solve this? Thanks.

df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None)) 
add_datepart(df_all, 'date')

@andrew77 nope andrew I got busy with other problems. I totally missed to solve it. However you can follow the thread here: https://datascience.stackexchange.com/questions/45089/operating-on-a-dataset-with-125-497-040-records

I wrote some code to breakout down the dataset. It is buggy. I didn’t solve it but people have given the right guidance. See if you can fix following their instructions

1 Like