Return original target names as predictions for tabular classification

SasukeSlime · June 6, 2024, 4:00am

I am building a solution for the academic success Kaggle tabular playground series. The competition is a classification problem, with the target variable having 3 classes: Graduate, Dropout, and Enrolled.

I use the fast ai TabularPandas object to define my pre-processing steps and build my data objects X_train and y_train which are the train set and labels, respectively

After training a model, it returns the predictions in the numerical format for example 0, 1,0…etc as illustrated below
submission = pd.read_csv(‘submission.csv’)
submission.head()
returns

	id	Target
0	76518	0
1	76519	2
2	76520	2
3	76521	2
4	76522	1
...	...	...
51007	127525	0
51008	127526	0
51009	127527	0
51010	127528	0
51011	127529	0
51012 rows × 2 columns

However, I noticed the Kaggle submission file expects the actual categories for example Graduate, Dropout
submit = pd.read_csv(path/‘sample_submission.csv’)

submit returns

id	Target
0	76518	Graduate
1	76519	Graduate
2	76520	Graduate
3	76521	Graduate
4	76522	Graduate
...	...	...
51007	127525	Graduate
51008	127526	Graduate
51009	127527	Graduate
51010	127528	Graduate
51011	127529	Graduate
51012 rows × 2 columns

I realized this is due to the TabularPandas object categorizing my target variable when building a TabularPandas object of y_block=CategoryBlock().
How do I decode/uncategorize the predictions, so I can return the predictions without the encoding, as in the sample submission ?

Any help with this would be highly appreciated.

Below is the code for your reference, or you can refer to the actual notebook via the github link here.

train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')
sub_df = pd.read_csv(path/'sample_submission.csv')
original_df = pd.read_csv('data.csv')
cont_names,cat_names = cont_cat_split(train_df, dep_var='Target')
splits = RandomSplitter(valid_pct=0.2)(range_of(train_df))
to = TabularPandas(train_df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   y_names='Target',
                   y_block=CategoryBlock(),
                   splits=splits)

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

dls = to.dataloaders(bs=64)
test_dl = dls.test_dl(test_df)
xgb_model = xgb.XGBClassifier(n_estimators = 197, max_depth=4, learning_rate=0.1818695751227044, subsample= 0.39774994666482544)
xgb_model = xgb_model.fit(X_train, y_train)

xgb_preds = tensor(xgb_model.predict(test_dl.xs))

xgb_preds_x = tensor(xgb_model.predict(X_test))
test_df['Target'] = rf_preds
test_df.to_csv('submission.csv', columns=['Target'], index=True)

submission = pd.read_csv('submission.csv')
submission.head()
returns 
	id	Target
0	76518	0
1	76519	2
2	76520	2
3	76521	2
4	76522	1
...	...	...
51007	127525	0
51008	127526	0
51009	127527	0
51010	127528	0
51011	127529	0
51012 rows × 2 columns

yet
submit = pd.read_csv(path/'sample_submission.csv')
returns
id	Target
0	76518	Graduate
1	76519	Graduate
2	76520	Graduate
3	76521	Graduate
4	76522	Graduate
...	...	...
51007	127525	Graduate
51008	127526	Graduate
51009	127527	Graduate
51010	127528	Graduate
51011	127529	Graduate
51012 rows × 2 columns

vbakshi · June 6, 2024, 3:57pm

You might try something like this where you use dls.vocab to get the class names using the predictions as the index:

# get predictions
probs,_,idxs = learn.get_preds(dl=tst_dl, with_decoded=True)

# get mapping from prediction (idxs) to class names in vocab
mapping = dict(enumerate(dls.vocab))

# map from predicted idx to class name
results = pd.Series(idxs.numpy(), name="idxs").map(mapping)

# export to CSV
ss['label'] = results
ss.to_csv('subm.csv', index=False)

SasukeSlime · June 7, 2024, 4:23am

Thanks a lot for the reply @vbakshi, I truly appreciate it.

So far, it seems to be working fine for my neural net preds. I am now trying to modify it to work with predictions for other type of models, basically by rewriting just the first line where you get the preds and indexes.

I am going to experiment with this thoroughly, then provide feedback.

SasukeSlime · July 1, 2024, 7:59am

So I ended up with 2 solutions for this, which both seemed to work the same way.

Solution 1
I modified the above code shared by @vbakshi to end up with

mapping = dict(enumerate(dls.vocab))
xgb_predicted_labels = [mapping[value.item()] for value in xgb_preds]
submit = pd.read_csv(path/'sample_submission.csv')
submit.Target = predicted_labels
submit.to_csv('submission.csv',index=False)
submit

Solution 2

xgb_names = to.vocab[xgb_preds]
submit = pd.read_csv(path/'sample_submission.csv')
submit.Target = xgb_names
submit.to_csv('submission.csv',index=False)
submit