I am building a solution for the academic success Kaggle tabular playground series. The competition is a classification problem, with the target variable having 3 classes: Graduate, Dropout, and Enrolled.
I use the fast ai TabularPandas object to define my pre-processing steps and build my data objects X_train and y_train which are the train set and labels, respectively
After training a model, it returns the predictions in the numerical format for example 0, 1,0…etc as illustrated below
submission = pd.read_csv(‘submission.csv’)
submission.head()
returns
id Target
0 76518 0
1 76519 2
2 76520 2
3 76521 2
4 76522 1
... ... ...
51007 127525 0
51008 127526 0
51009 127527 0
51010 127528 0
51011 127529 0
51012 rows × 2 columns
However, I noticed the Kaggle submission file expects the actual categories for example Graduate, Dropout
submit = pd.read_csv(path/‘sample_submission.csv’)
submit returns
id Target
0 76518 Graduate
1 76519 Graduate
2 76520 Graduate
3 76521 Graduate
4 76522 Graduate
... ... ...
51007 127525 Graduate
51008 127526 Graduate
51009 127527 Graduate
51010 127528 Graduate
51011 127529 Graduate
51012 rows × 2 columns
I realized this is due to the TabularPandas object categorizing my target variable when building a TabularPandas object of y_block=CategoryBlock().
How do I decode/uncategorize the predictions, so I can return the predictions without the encoding, as in the sample submission ?
Any help with this would be highly appreciated.
Below is the code for your reference, or you can refer to the actual notebook via the github link here.
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')
sub_df = pd.read_csv(path/'sample_submission.csv')
original_df = pd.read_csv('data.csv')
cont_names,cat_names = cont_cat_split(train_df, dep_var='Target')
splits = RandomSplitter(valid_pct=0.2)(range_of(train_df))
to = TabularPandas(train_df, procs=[Categorify, FillMissing,Normalize],
cat_names = cat_names,
cont_names = cont_names,
y_names='Target',
y_block=CategoryBlock(),
splits=splits)
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()
dls = to.dataloaders(bs=64)
test_dl = dls.test_dl(test_df)
xgb_model = xgb.XGBClassifier(n_estimators = 197, max_depth=4, learning_rate=0.1818695751227044, subsample= 0.39774994666482544)
xgb_model = xgb_model.fit(X_train, y_train)
xgb_preds = tensor(xgb_model.predict(test_dl.xs))
xgb_preds_x = tensor(xgb_model.predict(X_test))
test_df['Target'] = rf_preds
test_df.to_csv('submission.csv', columns=['Target'], index=True)
submission = pd.read_csv('submission.csv')
submission.head()
returns
id Target
0 76518 0
1 76519 2
2 76520 2
3 76521 2
4 76522 1
... ... ...
51007 127525 0
51008 127526 0
51009 127527 0
51010 127528 0
51011 127529 0
51012 rows × 2 columns
yet
submit = pd.read_csv(path/'sample_submission.csv')
returns
id Target
0 76518 Graduate
1 76519 Graduate
2 76520 Graduate
3 76521 Graduate
4 76522 Graduate
... ... ...
51007 127525 Graduate
51008 127526 Graduate
51009 127527 Graduate
51010 127528 Graduate
51011 127529 Graduate
51012 rows × 2 columns