Why is nas variable necessary?

Duncan · October 27, 2018, 6:49am

I have found the ‘nas’ variable twice. Why is it necessary?

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)'

and

df, y, nas = proc_df(df_raw, 'SalePrice')

I am trying to understand what is the use for ‘nas’

Thanks in advance.

vvinay · October 30, 2018, 12:21am

I think nas is just to identify what variables had missing values and what value it is replaced with.

{'auctioneerID': 2.0, 'MachineHoursCurrentMeter': 0.0}

Key has the column name and value has the filler ( which in this case I think is the median of that column)

Buddhi · October 30, 2018, 7:58am

When you use the proc_df function. It finds numerical columns that have missing values and creates an additional boolean column as well as replacing the missing with medians. It also converts categorical objects to integer codes.

Assuming the model has been trained with a different set of data or a subset of data.
In either of the validation or training set, that same numerical column might not have missing values hence it will not create this additional boolean column when passed through proc_df resulting in an error when passed through the model. Also if it did have missing values, the median value it replaces it with might be different as its a different set of data which in turn will process it into something which has a different semantics.

So the additional variable ‘nas’ is used as a dictionary whose keys are the names of the columns that had missing values and the value in the dictionary is the median. So when training a different set of data, ‘nas’ can be passed through into proc_df as an argument to make sure those specific columns are made and the missing values are replaced with the same median.

Duncan · November 1, 2018, 9:07am

Thanks for taking your time to reply. I appreciate.

Duncan · November 1, 2018, 9:09am

Got it. This well explained and I have understood. Thanks so much

YunusDev · November 18, 2018, 2:44pm

Hello pls how do i apply proc_df on my test data

Duncan · November 19, 2018, 4:03am

There is now an additional return variable nas from proc_df which is a dictionary whose keys are the names of the columns that had missing values, and the values of the dictionary are the medians. Optionally, you can pass nas to proc_df as an argument to make sure that it adds those specific columns and uses those specific medians:

df, y, nas = proc_df(df_raw, 'SalePrice', nas)

This answer is well illustrated here

Duncan · November 19, 2018, 4:08am

Also @Buddhi’s explanation tackles your question.Check it out.

raimanu-ds · November 19, 2018, 11:58am

Hi all,

While trying to implement a quick random forest classifier in the titanic kaggle competition, I ran into an error when using proc_df on my test set.

Proc_df created a 'Fare_na' column, which was not in my train set when I initially fitted my model. Therefore, when I ran m.predict on my test set, I gave me an error because it add an extra feature.

Wouldn’t the normal behavior of proc_df be not to create additional na columns for the test set ? Or am I doing something wrong ?

Duncan · November 19, 2018, 12:40pm

@raimanu-ds could you please post the error and the code associated with it?

However I believe your code should look like this

X_test,nas =proc_df(test,nas)

raimanu-ds · November 19, 2018, 1:56pm

Here is the code:

path = '../input/'

train_set = pd.read_csv(f'{path}train.csv')
test_set = pd.read_csv(f'{path}test.csv')

train = train_set.copy()
test = test_set.copy()

train_cats(train)

X, y, nas = proc_df(train, 'Survived')

m = RandomForestClassifier(n_estimators=40, n_jobs=-1, oob_score=True, min_samples_leaf=3, max_features=0.7)
m.fit(X,y)

print('accuracy score:', m.score(X, y))
print('obb_score:', m.oob_score_)

apply_cats(test, train)

X_test, _, nas = proc_df(test, na_dict=nas)

m.predict(X_test)

Error message:

ValueError                                Traceback (most recent call last)
<ipython-input-2-928f608dce71> in <module>
     21 X_test, _, nas = proc_df(test, na_dict=nas)
     22 
---> 23 m.predict(X_test)

ValueError: Number of features of the model must match the input. Model n_features is 12 and input n_features is 13

Duncan · November 21, 2018, 7:40am

x_test,_,nas,mapper = proc_df(test, do_scale=True, mapper=mapper, na_dict=nas)

This has worked for me. From the documentation

as an output

mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continuous
variables which is then used for scaling of during test-time.

as an input

mapper: If do_scale is set as True, the mapper variable
calculates the values used for scaling of variables during training time (mean and standard deviation)

do_scale: Standardizes each column in df. Takes Boolean Values(True,False)

@raimanu-ds

Duncan · November 28, 2018, 9:28am

Hi @Buddhi does the following statement hold as true?

if the training data has more columns having missing values than the test data, you should include
na_dict(dictionary of missing values) as an argument when handling missing values in the test set. Vice versa, if the test set has more columns with missing values, the dictionary should be included in the training set when handling missing values

Buddhi · November 28, 2018, 11:25am

So according to Jeremys lectures, from the notes generated by Hiromi Suenaga. He states the following:

When you call proc_df on a larger dataset, you do not pass in nas but you want to keep that return value.
Later on, when you want to create a subset (by passing in subset ), you want to use the same missing columns and medians, so you pass nas in.
If it turns out that the subset was from a whole different dataset and had different missing columns, it would update the dictionary with additional key value.
It keeps track of any missing columns you came across in anything you passed to proc_df.

So I think its best practice to firstly generate na_dict on the training set first, then you generate it on the test set as both an input and output variable which will update the na_dict. Generate the training set again using na_dict as both an input and output parameter then begin training. Does that sound right?

This all depends if the larger dataset is the training set, in case its the other way round, I guess start off applying proc_df on the test set and then training set.

Duncan · November 28, 2018, 11:47am

Thanks @Buddhi, perfectly understood. And yes, it does sound right.