Predictions to df/csv

I found a tabular data problem and used the notebook from lesson 4 as a template. After training my model and successfully making single predictions and predictions for 200K rows I have a hard time finding how to put it into a csv for submission to kaggle.

Data

df = pd.read_csv(path/'train.csv') # used for training/validation
df_test = pd.read_csv(path/'test.csv') # to produce submission file to kaggle

Single prediction

row = df_test.iloc[0]
learn.predict(row) 
#yields (Category 0, tensor(0), tensor([0.9457, 0.0543]))

All predictions

test_data = (TabularList.from_df(df_test, path=path, cont_names=cont_names, procs=procs)
                           .split_none()
                           .label_const('target')
                           .databunch())

preds = learn.get_preds(test_data)

#yields 
[tensor([[0.9893, 0.0107],
     [0.6970, 0.3030],
     [0.9530, 0.0470],
     ...,
     [0.8930, 0.1070],
     [0.9864, 0.0136],
     [0.9965, 0.0035]]), tensor([0, 0, 0,  ..., 0, 0, 0])]

The submission csv should have the form

ID_code target
test_0 0
test_1 0
test_2 1
test_3 0
...

What do I do from here?
Thank you!

1 Like

In case you need to submit labels.

preds = learn.get_preds(test_data)[1].numpy()
final_df = pd.DataFrame({'ID_code': df_test['ID_code'], 'target': preds})
final_df.to_csv('submission.csv', header=True, index=False)
2 Likes

Thanks a lot @rohit_gr, that looks great! I got another bug though, it seems get_preds drops 20% data (equal to my validation split) in the TEST dataset. Any idea why? Full notebook below…

Tabular models

from fastai.tabular import *
from pathlib import Path
path = Path('/home/jupyter/.fastai/data/santander')
df = pd.read_csv(path/'train.csv')
df_test = pd.read_csv(path/'test.csv')
dep_var = 'target'
cont_names = df.columns.values[2:]

procs = [FillMissing, Normalize]
data = (TabularList.from_df(df, path=path, cont_names=cont_names, procs=procs)
                           .split_by_rand_pct(0.2)
                           .label_from_df(cols=dep_var)
                           .databunch())
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
lr_find(learn)
learn.recorder.plot()
learn.fit(10, 1e-1)

Submission


test_data = (TabularList.from_df(df_test, path=path, cont_names=cont_names, procs=procs)
                           .split_none()
                           .label_const('target')
                           .databunch())

preds = learn.get_preds(test_data)[1].numpy()


final_df = pd.DataFrame({'ID_code': df_test['ID_code'], 'target': preds})
final_df.to_csv('/home/jupyter/transfer/submission.csv', header=True, index=False)
---------------------------------------------------------------------------

ValueError: array length 160000 does not match index length 200000 (!!!)
df_test['ID_code'].size, preds.size
(200000, 160000)
1 Like

Can you print test_data and share the result?

1 Like

Print of test_data summary

TabularDataBunch;
Train: LabelList (200000 items)
Valid: LabelList (0 items)
Test: None

That looks fine(?)
Though after putting it through learn.get_preds only 160000 records remain

1 Like

Hi,

I think this is because get_preds doesn’t get as an input parameter a dataset, but type of prediction.
https://docs.fast.ai/basic_train.html#Learner.get_preds , and default it is get_preds ( ds_type : DatasetType = <DatasetType.Valid: 2> .
So I think it returns you a prediction of train dataset. (DatasetType.Train) is equal 1 from enum and non-empty dataframe is also consider as 1. (length of your train DataSet is 160,000)

You should do something like this (or use add_test() function):

# TEST DATA
learn.export(<PATH AND NAME>)
test_data = (TabularList.from_df(df_test, path=path, cont_names=cont_names, procs=procs)
                           .split_none()
                           .label_const('target')
                           .databunch())

test_learner = load_learner(<PATH>,fname=<NAME>, test=test_data)
preds = test_learner.get_preds(ds_type=DatasetType.Test)[1].numpy()

final_df = pd.DataFrame({'ID_code': df_test['ID_code'], 'target': preds})
final_df.to_csv('/home/jupyter/transfer/submission.csv', header=True, index=False)
1 Like

@klemenka is right.
Or you can do something like:

data = (TabularList.from_df(df, path=path, cont_names=cont_names, procs=procs)
        .split_by_rand_pct(0.2)
        .label_from_df(cols=dep_var)
        .add_test(TabularList.from_df(df_test, cont_names=cont_names, procs=procs))
        .databunch())

######
#  Create learner and train
######

preds = learn.get_preds(ds_type=DatasetType.Test)[1].numpy()
final_df = pd.DataFrame({'ID_code': df_test['ID_code'], 'target': preds})
final_df.to_csv('submission.csv', header=True, index=False)

Perfect, thank you so much! When using load_lerner I just had to do test=test_data.x as it requires an ItemList.

1 Like

Beautiful, I like this approach to keep it more tidy/compact!
Thank you both @rohit_gr and @klemenka for your guidance! Im slowly getting familiar with the documentation and it was good to test both methods.

1 Like

Hello,
I am just working my way through tabular part of course 1 and have been struggling to output predictions to a csv. I would like to export all predictions from my test set. I am seeing this as a critical next step, as I have tabular scenarios I would like to explore for my work and would like to attempt a kaggle challenge. I am sure it is simple fix, however I have been struggling for the past few weeks on this problem.

Tabular models

from fastai.tabular import *

Tabular data should be in a Pandas DataFrame.

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/‘adult.csv’)

len(df)

dep_var = ‘salary’
cat_names = [‘workclass’, ‘education’, ‘marital-status’, ‘occupation’, ‘relationship’, ‘race’]
cont_names = [‘age’, ‘fnlwgt’, ‘education-num’]
procs = [FillMissing, Categorify, Normalize]

test = TabularList.from_df(df.iloc[700:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(800,1000)))
.label_from_df(cols=dep_var)
.add_test(test)
.databunch())

data.show_batch(rows=10)

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

learn.fit(1, 1e-2)

Inference

row = df.iloc[0]

learn.predict(row)

len(test)

preds = learn.get_preds(ds_type=DatasetType)[1].numpy()
final_df = pd.DataFrame({‘Education’: test[‘education’], ‘target’: preds})
final_df.to_csv(‘submission.csv’, header=True, index=False)

This code gives the following error: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices.

Any help would be appreciated.

Thanks - Andrew

Welcome! You should instead do:

y, _ = learn.get_preds(DatasetType.Test)
y = torch.argmax(y, dim=1)
preds = [learn.data.classes[int(x)] for x in y]

Then you can chuck all those values to a csv.

2 Likes

Thank you very much @muellerzr I really appreciate it!
Would you mind sharing the last line of code to bring ‘preds’ into a csv?

Sure! To follow what you were doing, first we need to make test into a dataframe itself,

testdf = df.iloc[800:1000].copy()
final_df = pd.DataFrame({'Education': testdf['education'], 'target': preds})

If you’re wanting to look at those distributions, there’s a new widget coming to the library shortly (waiting on the PR to finish getting approved) to plot that :wink:

2 Likes

It worked!!! thank you once again @muellerzr!!

If my data is image instead of tabular data where i have separate folder of train and test and separate csv of train and test, how should i predict to csv file.