Question about Tabular prediction

mlabs · December 2, 2020, 8:49pm

I was playing around with the example for tabular data (adult.csv) and not getting the results I would expect… so maybe somebody can sanity check what I’m doing here?
First I trained the model as described in the book:

   from fastai.tabular.all import *

   path = untar_data(URLs.ADULT_SAMPLE)

   dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
      cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
      cont_names = ['age', 'fnlwgt', 'education-num'],
      procs = [Categorify, FillMissing, Normalize])

   learn = tabular_learner(dls, metrics=accuracy)

   learn.fit_one_cycle(3)

Seemed to go ok, so now I want to do some predictions on new data. Looking at the examples in the book (and docs) I was a bit confused because they appeared to be doing predictions on the train/validation set (instead of on new, unseen data):

   path = untar_data(URLs.ADULT_SAMPLE)
   df = pd.read_csv(path/'adult.csv')
     cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
     cont_names = ['age', 'fnlwgt', 'education-num']
     procs = [Categorify, FillMissing, Normalize]

   dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, 
      cont_names=cont_names, 
      y_names="salary", valid_idx=list(range(800,1000)), bs=64)
  learn = tabular_learner(dls)
  row, clas, probs = learn.predict(df.iloc[0])

So I decided to make some fake new data by copying the first two lines of adult.csv into a new file ‘test.csv’, removing the salary column (as that is what we are trying to predict) and editing the rows with fake values … eg:

test.csv:

age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
48, Private,101320, Masters,14.0, Single,Exec-managerial, Unmarried, White, Male,0,0,40, United-States
22, Private,236746, HS-Grad,8.0, Single,Transport-moving, Unmarried, White, Male,0,0,40, United-States

Then I loaded this into a Pandas dataframe and used it for predictions:

   dft = pd.read_csv(path/'test.csv', low_memory=False)

row0:

  row, clas, probs = learn.predict(dft.iloc[0])
  row.show()

predicted salary <50k

row1:

 row, clas, probs = learn.predict(dft.iloc[1])
 row.show()

predicted salary <50k

Q: Am I on the right track here in the way i’m doing inference or am I missing something?
Q: I would have expected row0 to predict salary >=50k but it didn’t … any ideas why? Just a one-off bad prediction?

Thanks

Aleksandr · July 15, 2021, 6:42pm

Hi, me too tried using predict function after completing lesson on buldosers from 2020class. Predictions are quite a bit off from RF results, like 10k vs 40k saleprice, 50k vs 110k etc.

Please somebody from Seniors or from Team teach how to do .predict properly?

Thanks a lot for knowledge!

muellerzr · July 16, 2021, 12:24pm

I show it in my tabular lesson here: Lesson 2 - Tabular Regression and Permutation Importance | walkwithfastai

You can still do this via .predict and pass in a single unpreprocessed row as well

Aleksandr · July 18, 2021, 8:21am

Thank you, thats exactly what I planned next - to work out all your lessons on tabular, glad I found it