How to use proc_df on a test set?

YunusDev · November 18, 2018, 2:47pm

For example on a kaggle set after using proc_df for the training set how do i apply proc_df similarly on the test set?

maciejkpl · December 4, 2018, 9:36pm

Hi Yunus,

Lets say your train data is called train and your test data is called test
In order for your model to work you need to change strings into categories in both. To do so:
train_cats(train)
apply_cats(df=test, trn=train)
you are using train cats to create categories, and apply cats so the same words are given the same number.
so for example strawberry in train is 1 and in test data will also be called 1.
next proc_df:
X, y , nas = proc_df(train, ‘target’)
X_test, _, nas = proc_df(test, na_dict=nas)
X, y , nas = proc_df(train, ‘target’, na_dict=nas)

here the key is na_dict.
Lets say your training set had NA in column AGE. But your test set had NA in column EYES.
if you use simply use proc_df on both you will have 2 different data frames, as one of them will have additional column AGE_NA, and second one EYES_NA.
if you use na_dict you are creating the AGE_NA column also in second data frame. But now you dont have EYES_NA in the first data frame. So we are proc_df for the 3rd time to pass NA columns (now both AGE_NA and EYES_NA back to train.
In result you will have both data frames will same additional columns when proc_df was checking for na’s.

the simplest code to get your going:

train = pd.read_csv(’…/input/train.csv’)
test = pd.read_csv(’…/input/test.csv’)

train_cats(train)
apply_cats(df=test, trn=train)

X, y , nas = proc_df(train, ‘target_column’)
X_test,_,nas = proc_df(test, na_dict=nas)
X, y , nas = proc_df(train, ‘target_column’, na_dict=nas)

model = RandomForestRegressor(n_jobs=4, n_estimators=100)
model.fit(X, y)
model.score(X, y))

prediction =model.predict(X_test)

submission = pd.DataFrame()
submission['id_column]=test.id
submission[‘target_column’]=prediction
submission.to_csv(‘submission.csv’,index=False)

Arindam · January 2, 2019, 9:15pm

Can you explain why are you using proc_df on train set two times and only one time on test set. I think using proc_df is sufficient, correct me if i am wrong @maciejkpl. Moreover i tried this in a kaggle kernel for the bulldozer competition and on fitting the model with the validation set, i was getting a weird error which stated that .cat can be only used with categorical variables only but earlier on using apply_cats i didn’t get any error.

**Here kaggle has provided a separate validation set for the bulldozer competition. So i think there is no point of splitting the training dataset.

kmh5004 · February 25, 2019, 12:03am

I don’t think you need the second use of proc_df on the training set.

YunusDev · February 25, 2019, 8:00am

Tanks for this…

arora_aman · September 2, 2019, 11:58pm

Thanks for this, really helpful…