Tabular Data: Evaluate Prediction for pre-splitted dataset

Peete · October 4, 2020, 11:03am

Given a pre-splitted dataset for training and testing, I am wondering how to apply the prediction in fastai accordingly to access MAE and RMSE values.

The following example is from fastai and slightly modified with the train_test_split from sklearn to simulate the initial situation.

import numpy as np
from sklearn.model_selection import train_test_split
from fastai.tabular.all import *
import pandas as pd

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

train, test = train_test_split(df, test_size=0.20, random_state=42)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(train, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary")
learn = tabular_learner(dls)


learn.fit_one_cycle(5)

epoch   train_loss  valid_loss  time
0   0.378432    0.356029    00:05
1   0.369692    0.358837    00:05
2   0.355757    0.348524    00:05
3   0.342714    0.348011    00:05
4   0.334072    0.346690    00:05


learn.unfreeze()
learn.fit_one_cycle(10, max_lr=slice(10e-4, 10e-3))

epoch   train_loss  valid_loss  time
0   0.343953    0.350457    00:05
1   0.349379    0.353308    00:04
2   0.360508    0.352564    00:04
3   0.338458    0.351742    00:05
4   0.334585    0.352128    00:05
5   0.342312    0.351003    00:04
6   0.329152    0.350455    00:05
7   0.334460    0.351833    00:05
8   0.328608    0.351415    00:05
9   0.333205    0.352079    00:04

Given a pre-splitted dataset for training and testing, I am wondering how to apply the prediction in fastai accordingly to access MAE and RMSE values.

The following example is from fastai and slightly modified with the train_test_split from sklearn.

import numpy as np
from sklearn.model_selection import train_test_split
from fastai.tabular.all import *
import pandas as pd

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

train, test = train_test_split(df, test_size=0.20, random_state=42)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(train, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary")
learn = tabular_learner(dls)


learn.fit_one_cycle(5)

epoch   train_loss  valid_loss  time
0   0.378432    0.356029    00:05
1   0.369692    0.358837    00:05
2   0.355757    0.348524    00:05
3   0.342714    0.348011    00:05
4   0.334072    0.346690    00:05


learn.unfreeze()
learn.fit_one_cycle(10, max_lr=slice(10e-4, 10e-3))

epoch   train_loss  valid_loss  time
0   0.343953    0.350457    00:05
1   0.349379    0.353308    00:04
2   0.360508    0.352564    00:04
3   0.338458    0.351742    00:05
4   0.334585    0.352128    00:05
5   0.342312    0.351003    00:04
6   0.329152    0.350455    00:05
7   0.334460    0.351833    00:05
8   0.328608    0.351415    00:05
9   0.333205    0.352079    00:04

Now how can I apply the learn model to my test set to compute my metrics? Something like the following is not working for me:

learn.predict(test)

Here I get the following Error: AttributeError: 'DataFrame' object has no attribute 'to_frame'

Thanks for your help in advance!

Peete · October 12, 2020, 7:04pm

I ended up writing a simple for-loop for every prediction.

Of course this is far from being efficient, but solved my issue. If you have any suggestions for improvements to overcome the slow for-loop, feel free to comment below.

predicted = []
real = []
for elem in range(0,len(test),1):
    row, clas, probs = learn.predict(test.iloc[elem])
    predicted.append(row["salary"].iloc[-1])
    real.append(test["salary"].iloc[elem])

muellerzr · October 12, 2020, 7:42pm

You should use the test_dl approach shown in the documentation and then post-process to turn the probabilities to your classes: https://docs.fast.ai/tutorial.tabular

willtonkin · March 21, 2023, 7:54pm

updated URL:- fastai - Tabular training