Single Prediction on new data from Tabular Data Learner

dom.raute · October 31, 2018, 11:59am

Hi there!

I went ahead and managed to get through the whole pipeline of DataFrame -> TabularDataBunch -> TabularDataModel unharmed, but now I’m a bit stuck (also due to the considerable differences of 0.7 from the MOOCs to 1.0).

I’m basically at the point of

learn = get_tabular_learner(data, layers=[200,100], metrics=accuracy)
learn.fit(3, 1e-2)

…which nicely trains me a model where I’m happy with over/underfitting and accuracy.

And now I’m stuck. The docs basically end there, but how do I use the model to predict new data? learn.predict() gives me an error (AttributeError: 'Learner' object has no attribute 'predict') and even if it worked, it only works with the training and validation set.

I’d like to get a prediction from the model from brand new input. I’ve read some stuff on the forums already that seems applicable (especially Single Prediction with NLP example ), but as far as I understand, the TabularDataBunch and -Learner do some “magic” under the hood such as looking up embeddings and do some kinds of normalization.

If I have a new input event now that wasn’t part of any of the original [test,train,validation] sets, how can I apply the same transformations to the raw data and get a single or batch prediction from my newly trained model?

Thanks in advance,

Dominik

sgugger · October 31, 2018, 1:18pm

For now, you’ll have to work a bit yourself and adapt the script you mentioned. We are in the process of simplifying the way to make predictions with a trained model. It’s done for vision, we’ll move to text now, then to tabular and colab.

dom.raute · October 31, 2018, 2:27pm

Hey @sgugger, thanks for the fast reply!

Good to know you’re working on it, but for right now, I fear I don’t even know where to start “adapting the script”.

I also haven’t found any usable results for TabularData* on Github where I could extract code or knowledge from (really nobody used Fast.AI v1 in prod so far?)

If you or someone else could be so kind to point me to anyone who has done it by hand — or even just enlighten me about the kind of transformations that need to be applied to input data and where the lookup tables / normalizations are stored, that would be pretty awesome.

I really love all the magic that Fast.AI does, but currently it’s a black box where I can get stuff in, but not out: that doesn’t feel very v1.0 to me.

sgugger · October 31, 2018, 2:34pm

Well you’d have to apply the set of transforms to your new data:

for tfm in train_ds.tfms: df_tst = tfm(df_tst, is_test=True)

Then you have to transforms the categorical variables into codes:

cats = np.stack([c.cat.codes.values for n,c in df_tst[cat_names].items()], 1) + 1

and normalize the continuous variables:

conts = np.stack([c.astype('float32').values for n,c in df_tst[cont_names].items()], 1)
means, stds = train_ds.stats 
conts = (conts - means[None]) / (stds[None]+1e-7)

And finally, you can run the model on [tensor(cats),tensor(conts)] (or some batches of it if you have a lot things to predict).

dom.raute · October 31, 2018, 2:40pm

Hey @sgugger,

THANKS so much, that’s exactly what I was looking for, I guess I can take it from here

kardon · November 1, 2018, 12:25pm

I still can’t manage to get this working.
I’m trying the following:

1.
model.cpu()
model.eval()

because I use batch normalization

2.
for tfm in train_ds.tfms: tfm(df_tst, test=True)

tfm seems to be applied without return value and I changed is_test=True to test=True

3. 
cats = []

because I only use continuous variables for the inputs

4. 
conts = np.stack([c.astype('float32').values for n,c in df_tst[cont_names].items()], 1)
means, stds = train_ds.stats 
conts = (conts - means[None]) / (stds[None]+1e-7)

without change.

5.
pred = model(torch.tensor(cats), torch.tensor(conts))

I don’t get any error, but the predictions are wrong (I confirmed it by comparing with the values of the valid_ds with get_preds).
Am I missing something (like model.eval())?

sgugger · November 1, 2018, 1:34pm

What is your loss function? Depending on it, get_preds automatically adds an activation to return you the probabilities, and maybe that’s what is missing.

kardon · November 1, 2018, 1:56pm

You are right, I use cross_entropy so it might automatically add LogSoftmax.
But I just use argmax to get the maximum output neuron anyways, so it should make no difference.

edit: I found my mistake, I didn’t realize that train_ds.cont_names is reordered.

ricpruss · November 8, 2018, 10:14pm

Kardon how did you handle the cont_names/cat_names reordering as it is not actually stored in the model? In fact every time I do not get_learner the order is different.

@sgugger if you rethink the data api would be awesome if the data transformation was separate and something like modern pandas method chaining: https://tomaugspurger.github.io/method-chaining
This way it can cleanly be reproduced for prediction.

sgugger · November 9, 2018, 1:12am

You can already use any pandas preprocessing you want since you pass a dataframe in the end

ricpruss · November 11, 2018, 9:43pm

Just leaving a note for the next person who finds the thread. You need to set the cont names explicitly to be able to reuse the model else the order that it uses is random based on the encoding of a list. So do something like:
cont = list(set(train)-set(cat_names)-{dep_var})
cont.sort()
and pass that to get_tabular_learner so the order is explicit.

ikeaveiro · November 12, 2018, 12:46pm

Whe I run the following …

for tfm in data_df.train_ds.tfms:
df_tst = tfm(df_tst, test=True)

I get the following error:

TypeError Traceback (most recent call last)
in
1 df_tst = test_df.copy()
2 for tfm in data_df.train_ds.tfms:
----> 3 df_tst = tfm(df_tst, test=True)

~/notebooks/aveiro/conda/exit/envs/py36/lib/python3.6/site-packages/fastai/tabular/transform.py in call(self, df, test)
13 “Apply the correct function to df depending on test.”
14 func = self.apply_test if test else self.apply_train
—> 15 func(df)
16
17 def apply_train(self, df:DataFrame):

~/notebooks/aveiro/conda/exit/envs/py36/lib/python3.6/site-packages/fastai/tabular/transform.py in apply_test(self, df)
33 def apply_test(self, df:DataFrame):
34 for n in self.cat_names:
—> 35 df[n] = pd.Categorical(df[n], categories=self.categories[n], ordered=True)
36
37 FillStrategy = IntEnum(‘FillStrategy’, ‘MEDIAN COMMON CONSTANT’)

TypeError: ‘NoneType’ object is not subscriptable

The model trains well and I am able to look at the results for the test set. The only challenge has been to reapply my two transformations FillMissing and Categorify.

Am I missing something in my transformation? The output of my first transformation comes back empty.

ricpruss · November 13, 2018, 11:39pm

So there is a checking that just went in from sgugger with a totally new interface for data called data blocks and it factors this all out into a beautiful method chain. The documentation is not updated yet but it has tests which you can read and it is pretty obvious how to use and it is trivial to do this with that interface.

As to what your problem is above, I did not do it like that I did TabularDataset.from_dataframe and passed the pieces to it.

ikeaveiro · November 14, 2018, 4:04pm

Thank you for the input. I just saw that there is a new tabular notebook in github (under examples). It is definitely helpful. I found a way around my issue, I saved all the pandas categorical transformations then loaded up to the test set and apply them. It worked, but it took a few lines of code to get everything in shape.

aleksod · November 14, 2018, 5:09pm

Can you please share the code? It would be very helpful.

ikeaveiro · November 14, 2018, 5:24pm

I use the following to save/load the important info I needed:

if TRAINING:
encoders = {}
for cat in cat_names:
    print(' Encoding {:}'.format(cat))
    
    # convert to string
    df[cat] = df[cat].astype(str)
    
    # get categorial uniques for encoding
    categories = df[cat].unique()
    df[cat] = pd.Categorical(df[cat].astype(str), categories=categories)
    
    encoders[cat]=categories

# save encoders
with open('encoders_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'wb') as handle:
    pickle.dump( encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

# categoricals
cat_sz = [(c, df[c].nunique()+1) for c in cat_names]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]    

# save embeddings info
with open('cat_sz_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'wb') as handle:
    pickle.dump( cat_sz, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('emb_szs_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'wb') as handle:
    pickle.dump(emb_szs, handle, protocol=pickle.HIGHEST_PROTOCOL)

# save output range
y_range=(df[dep_var].min(),df[dep_var].max())
with open('yrange_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'wb') as handle:
    pickle.dump(y_range, handle, protocol=pickle.HIGHEST_PROTOCOL)


else:
# load encoders
with open('encoders_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'rb') as handle:
    encoders = pickle.load( handle)

for cat in cat_names:
    print(' Encoding {:}'.format(cat))
    categories = encoders[cat]

    # encode new data
    df[cat] = pd.Categorical(df[cat].astype(str), categories=categories)

# load embeddings info
with open('cat_sz_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'rb') as handle:
    cat_sz = pickle.load( handle)

with open('emb_szs_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'rb') as handle:
    emb_szs = pickle.load( handle)
    
# load output range
with open('yrange_{:}.pickle'.format(FILENAME_SUBSCRIPT), 'rb') as handle:
    y_range = pickle.load( handle)

note: you need to add a tab after my if/else statements.

sgugger · November 14, 2018, 10:53pm

Quick note: learn.predict(row) where row is a row of a dataframe with the same column names as the original dataframe (with or without the dependent variable, that will be ignored) should work.

dom.raute · November 15, 2018, 1:47pm

@sgugger - That at would be pretty awesome, but unfortunately, it doesn’t work at least with my notebook. Which version of fast.ai is it supposed to work with? (I had another, unrelated bug with non-hashable DataFrames in v1.0.24 that pip got me as latest version)

IF it did work, this would be awesome to have in the docs

> learn = get_tabular_learner(data_bunch, layers=[200,100], metrics=accuracy)
> learn.fit(10, 1e-3)

 Total time: 03:02
 epoch  train_loss  valid_loss  accuracy
 1      0.292954    0.437396    0.849206  (00:14)

> learn.predict(valid_df.iloc[0])
 ---------------------------------------------------------------------------
 AttributeError                            Traceback (most recent call last)
 <ipython-input-23-046263f97c0c> in <module>
 ----> 1 learn.predict(valid_df.iloc[0])
 
 AttributeError: 'Learner' object has no attribute 'predict'

> fastai.__version__
'1.0.22'

sgugger · November 15, 2018, 2:30pm

It only works on the latest version as it was implemented pretty recently. Any help to add those new features in the overview documentation of each application would be greatly appreciated!

oxyd33 · November 15, 2018, 2:56pm

I just updated the fastai library with pip and now the code from https://docs.fast.ai/tabular.html
breaks. Before the update it worked fine. I did not touch my code, just updated fastai.

Here is the code that worked before the update (pretty much just copy paste, some modifications in the table but that was no problem):

from fastai import datasets as da
from fastai import basic_train as ba
from fastai import metrics as me
from fastai.tabular import transform as tr
from fastai.tabular import data as dat

path=’/path-to-folder/’
dep_var = ‘>=50k’

df=da.pd.read_csv(path + ‘adultOriginal_shortSmall.csv’)
cat_names=[‘workclass’, ‘occupation’, ‘sex’, ‘native-country’]

df.head()
tfms=[tr.FillMissing, tr.Categorify]
train_df, valid_df=df[:-2000].copy(),df[-2000:].copy()
data=dat.TabularDataBunch.from_df(path, train_df, valid_df, dep_var, tfms=tfms, cat_names=cat_names)
print(data.train_ds.cont_names)
(cat_x,cont_x),y=next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(ba.to_np(o[:5]))
learn=dat.get_tabular_learner(data, layers=[200,100], emb_szs={‘native-country’: 10}, metrics=me.accuracy)
learn.fit_one_cycle(1, 1e-2)

This is the error I get now (from data):
[data=dat.TabularDataBunch.from_df(path, train_df, valid_df, dep_var, tfms=tfms, cat_names=cat_names]

Traceback (most recent call last):
File “ADULT.py”, line 21, in
data=dat.TabularDataBunch.from_df(path, train_df, valid_df, dep_var, tfms=tfms, cat_names=cat_names)
File “/home/oxyd11/.virtualenvs/VE36new/lib/python3.6/site-packages/fastai/tabular/data.py”, line 113, in from_df
cont_names = ifnone(cont_names, list(set(df)-set(cat_names)-{dep_var}))
File “/home/oxyd11/.virtualenvs/VE36new/lib/python3.6/site-packages/pandas/core/generic.py”, line 1492, in hash
’ hashed’.format(self.class.name))
TypeError: ‘DataFrame’ objects are mutable, thus they cannot be hashed

–> so the newest version with ‘learn.predict(row)’ is only available through GIT?

THANKS