I was playing around with the example for tabular data (adult.csv) and not getting the results I would expect… so maybe somebody can sanity check what I’m doing here?
First I trained the model as described in the book:
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(3)
Seemed to go ok, so now I want to do some predictions on new data. Looking at the examples in the book (and docs) I was a bit confused because they appeared to be doing predictions on the train/validation set (instead of on new, unseen data):
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names,
cont_names=cont_names,
y_names="salary", valid_idx=list(range(800,1000)), bs=64)
learn = tabular_learner(dls)
row, clas, probs = learn.predict(df.iloc[0])
So I decided to make some fake new data by copying the first two lines of adult.csv into a new file ‘test.csv’, removing the salary column (as that is what we are trying to predict) and editing the rows with fake values … eg:
test.csv:
age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
48, Private,101320, Masters,14.0, Single,Exec-managerial, Unmarried, White, Male,0,0,40, United-States
22, Private,236746, HS-Grad,8.0, Single,Transport-moving, Unmarried, White, Male,0,0,40, United-States
Then I loaded this into a Pandas dataframe and used it for predictions:
dft = pd.read_csv(path/'test.csv', low_memory=False)
row0:
row, clas, probs = learn.predict(dft.iloc[0])
row.show()
predicted salary <50k
row1:
row, clas, probs = learn.predict(dft.iloc[1])
row.show()
predicted salary <50k
Q: Am I on the right track here in the way i’m doing inference or am I missing something?
Q: I would have expected row0 to predict salary >=50k but it didn’t … any ideas why? Just a one-off bad prediction?
Thanks