Why is nas variable necessary?

I have found the ‘nas’ variable twice. Why is it necessary?

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)'

and

df, y, nas = proc_df(df_raw, 'SalePrice')

I am trying to understand what is the use for ‘nas’

Thanks in advance.

1 Like

I think nas is just to identify what variables had missing values and what value it is replaced with.

{'auctioneerID': 2.0, 'MachineHoursCurrentMeter': 0.0}

Key has the column name and value has the filler ( which in this case I think is the median of that column)

When you use the proc_df function. It finds numerical columns that have missing values and creates an additional boolean column as well as replacing the missing with medians. It also converts categorical objects to integer codes.

Assuming the model has been trained with a different set of data or a subset of data.
In either of the validation or training set, that same numerical column might not have missing values hence it will not create this additional boolean column when passed through proc_df resulting in an error when passed through the model. Also if it did have missing values, the median value it replaces it with might be different as its a different set of data which in turn will process it into something which has a different semantics.

So the additional variable ‘nas’ is used as a dictionary whose keys are the names of the columns that had missing values and the value in the dictionary is the median. So when training a different set of data, ‘nas’ can be passed through into proc_df as an argument to make sure those specific columns are made and the missing values are replaced with the same median.

3 Likes

Thanks for taking your time to reply. I appreciate.

Got it. This well explained and I have understood. Thanks so much

Hello pls how do i apply proc_df on my test data

There is now an additional return variable nas from proc_df which is a dictionary whose keys are the names of the columns that had missing values, and the values of the dictionary are the medians. Optionally, you can pass nas to proc_df as an argument to make sure that it adds those specific columns and uses those specific medians:

df, y, nas = proc_df(df_raw, 'SalePrice', nas)

This answer is well illustrated here

Also @Buddhi’s explanation tackles your question.Check it out.

Hi all,

While trying to implement a quick random forest classifier in the titanic kaggle competition, I ran into an error when using proc_df on my test set.

Proc_df created a 'Fare_na' column, which was not in my train set when I initially fitted my model. Therefore, when I ran m.predict on my test set, I gave me an error because it add an extra feature.

Wouldn’t the normal behavior of proc_df be not to create additional na columns for the test set ? Or am I doing something wrong ?

@raimanu-ds could you please post the error and the code associated with it?

However I believe your code should look like this

X_test,nas =proc_df(test,nas)

Here is the code:

path = '../input/'

train_set = pd.read_csv(f'{path}train.csv')
test_set = pd.read_csv(f'{path}test.csv')

train = train_set.copy()
test = test_set.copy()

train_cats(train)

X, y, nas = proc_df(train, 'Survived')

m = RandomForestClassifier(n_estimators=40, n_jobs=-1, oob_score=True, min_samples_leaf=3, max_features=0.7)
m.fit(X,y)

print('accuracy score:', m.score(X, y))
print('obb_score:', m.oob_score_)

apply_cats(test, train)

X_test, _, nas = proc_df(test, na_dict=nas)

m.predict(X_test)

Error message:

ValueError                                Traceback (most recent call last)
<ipython-input-2-928f608dce71> in <module>
     21 X_test, _, nas = proc_df(test, na_dict=nas)
     22 
---> 23 m.predict(X_test)

ValueError: Number of features of the model must match the input. Model n_features is 12 and input n_features is 13

x_test,_,nas,mapper = proc_df(test, do_scale=True, mapper=mapper, na_dict=nas)

This has worked for me. From the documentation

as an output

mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continuous
variables which is then used for scaling of during test-time.

as an input

mapper: If do_scale is set as True, the mapper variable
calculates the values used for scaling of variables during training time (mean and standard deviation)

do_scale: Standardizes each column in df. Takes Boolean Values(True,False)

@raimanu-ds

Hi @Buddhi does the following statement hold as true?

if the training data has more columns having missing values than the test data, you should include
na_dict(dictionary of missing values) as an argument when handling missing values in the test set. Vice versa, if the test set has more columns with missing values, the dictionary should be included in the training set when handling missing values

So according to Jeremys lectures, from the notes generated by Hiromi Suenaga. He states the following:

  1. When you call proc_df on a larger dataset, you do not pass in nas but you want to keep that return value.
  2. Later on, when you want to create a subset (by passing in subset ), you want to use the same missing columns and medians, so you pass nas in.
  3. If it turns out that the subset was from a whole different dataset and had different missing columns, it would update the dictionary with additional key value.
  4. It keeps track of any missing columns you came across in anything you passed to proc_df.

So I think its best practice to firstly generate na_dict on the training set first, then you generate it on the test set as both an input and output variable which will update the na_dict. Generate the training set again using na_dict as both an input and output parameter then begin training. Does that sound right?

This all depends if the larger dataset is the training set, in case its the other way round, I guess start off applying proc_df on the test set and then training set.

Thanks @Buddhi, perfectly understood. And yes, it does sound right.