How to use proc_df on a test set?


For example on a kaggle set after using proc_df for the training set how do i apply proc_df similarly on the test set?

(Maciej Kedziora) #2

Hi Yunus,

Lets say your train data is called train and your test data is called test
In order for your model to work you need to change strings into categories in both. To do so:
apply_cats(df=test, trn=train)
you are using train cats to create categories, and apply cats so the same words are given the same number.
so for example strawberry in train is 1 and in test data will also be called 1.
next proc_df:
X, y , nas = proc_df(train, ‘target’)
X_test, _, nas = proc_df(test, na_dict=nas)
X, y , nas = proc_df(train, ‘target’, na_dict=nas)

here the key is na_dict.
Lets say your training set had NA in column AGE. But your test set had NA in column EYES.
if you use simply use proc_df on both you will have 2 different data frames, as one of them will have additional column AGE_NA, and second one EYES_NA.
if you use na_dict you are creating the AGE_NA column also in second data frame. But now you dont have EYES_NA in the first data frame. So we are proc_df for the 3rd time to pass NA columns (now both AGE_NA and EYES_NA back to train.
In result you will have both data frames will same additional columns when proc_df was checking for na’s.

the simplest code to get your going:

train = pd.read_csv(’…/input/train.csv’)
test = pd.read_csv(’…/input/test.csv’)

apply_cats(df=test, trn=train)

X, y , nas = proc_df(train, ‘target_column’)
X_test,_,nas = proc_df(test, na_dict=nas)
X, y , nas = proc_df(train, ‘target_column’, na_dict=nas)

model = RandomForestRegressor(n_jobs=4, n_estimators=100), y)
model.score(X, y))

prediction =model.predict(X_test)

submission = pd.DataFrame()