How to predict on a dataset other then the test and validation data sets

Hello,
I have been working with some structured data sets like the Rossman dataset. I saved a model which I want to use to predict several different data frames. And I can’t figure out how to predict off anything other then my test and validation data. I was wondering how I would go about predicting on a structured pandas data frame which is not my test_df, or if there was anyway which I could change my test_df after I saved and loaded my model.
Thanks I would appreciate any help I can get.

Did you use a learner? Do you have continuous and categorical variables? If so, something like this should work:

preds = to_np(learn.model(*V([tst_cats, tst_conts])))

Where learn is the fit fastai learner, tst_cats is a numpy array of the categorical variables in the dataset you want to score and tst_conts is a numpy array of the continuous variables in the dataset you want to score.

I just ran into a similar use case, I feel like there should be some method on the learner that lets you predict on an arbitrary dataset

You can always create another data object and apply to the model with learn.set_data(new_data)

I have a pandas data frame which I would like to use as the new test data frame. As a result, I ran m.set_data(joined_test). After I tried to set the new test data frame, I ran pred_test = m.predict(True) it gave me the error ‘DataFrame’ object has no attribute ‘test_dl’.Do you know how I can define the data frame as the new test data when using learn.set_data(new_data)?

set_data is a function of conv_learner which is used in CNNs like resnet and others. I am not sure if it will work in your case. Anyway, fastai models need ModelData objects, which have training dataset, validation dataset, training dataloader, etc. That is why you get that error. You can’t just assign any data.

Can you share more of your code?

I did that by using the ModelData object as @fredguth is implying. In my case, I took care to generate the df while generating train and validation data sets. Although of course it is not a requirement to do so at the same time, it does make you follow a sort of best-practice of passing any additional datasets through the same preprocessing pipeline your model went through.

@Patrick Yes I did use a learner and yes I do have continuous and categorical variables. When creating the numpy array of categorical variables and the other array of continuous variables I attempted using the pandas function DataFrame.as_matrix().
tst_cats = joined_test.as_matrix(columns=[all my categorical variables])

tst_conts = joined_test.as_matrix(columns=[all my continous variables])

preds = to_np(m.model(*V([tst_cats, tst_conts])))

It gives me back this error after running the final line above:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-87-7262c2075766> in <module>()
----> 1 preds = to_np(m.model(*V([tst_cats, tst_conts])))

~/fastai/courses/dl1/fastai/core.py in V(x, requires_grad, volatile)
     49 def V(x, requires_grad=False, volatile=False):
     50     '''creates a single or a list of pytorch tensors, depending on input x. '''
---> 51     return map_over(x, lambda o: V_(o, requires_grad, volatile))
     52 
     53 def VV_(x):

~/fastai/courses/dl1/fastai/core.py in map_over(x, f)
      6 def is_listy(x): return isinstance(x, (list,tuple))
      7 def is_iter(x): return isinstance(x, collections.Iterable)
----> 8 def map_over(x, f): return [f(o) for o in x] if is_listy(x) else f(x)
      9 def map_none(x, f): return None if x is None else f(x)
     10 def delistify(x): return x[0] if is_listy(x) else x

~/fastai/courses/dl1/fastai/core.py in <listcomp>(.0)
      6 def is_listy(x): return isinstance(x, (list,tuple))
      7 def is_iter(x): return isinstance(x, collections.Iterable)
----> 8 def map_over(x, f): return [f(o) for o in x] if is_listy(x) else f(x)
      9 def map_none(x, f): return None if x is None else f(x)
     10 def delistify(x): return x[0] if is_listy(x) else x

~/fastai/courses/dl1/fastai/core.py in <lambda>(o)
     49 def V(x, requires_grad=False, volatile=False):
     50     '''creates a single or a list of pytorch tensors, depending on input x. '''
---> 51     return map_over(x, lambda o: V_(o, requires_grad, volatile))
     52 
     53 def VV_(x):

~/fastai/courses/dl1/fastai/core.py in V_(x, requires_grad, volatile)
     46 def V_(x, requires_grad=False, volatile=False):
     47     '''equivalent to create_variable, which creates a pytorch tensor'''
---> 48     return create_variable(x, volatile=volatile, requires_grad=requires_grad)
     49 def V(x, requires_grad=False, volatile=False):
     50     '''creates a single or a list of pytorch tensors, depending on input x. '''

~/fastai/courses/dl1/fastai/core.py in create_variable(x, volatile, requires_grad)
     40 def create_variable(x, volatile, requires_grad=False):
     41     if type (x) != Variable:
---> 42         if IS_TORCH_04: x = Variable(T(x), requires_grad=requires_grad)
     43         else:           x = Variable(T(x), requires_grad=requires_grad, volatile=volatile)
     44     return x

~/fastai/courses/dl1/fastai/core.py in T(a, half, cuda)
     34         elif a.dtype in (np.float32, np.float64):
     35             a = torch.cuda.HalfTensor(a) if half else torch.FloatTensor(a)
---> 36         else: raise NotImplementedError(a.dtype)
     37     if cuda: a = to_gpu(a, async=True)
     38     return a

NotImplementedError: object

So how should I make the numpy Arrays of the categorical variables and the continuous variables?

@Gabriel_Syme I think I did pass in the test_df originally along with the validation and training data when I first created the model before I fit it, with this:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,test_df=df_test)

Is it possible to use that same “ColumnarModelData” to assign a new test_df after training the model. So that I can use the predict function to predict on the new test_df.
pred_test = m.predict(True)

Also @fredguth, that would make sense as to why it didn’t work I am not using a CNN. Here is some of my code:

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,
                                       test_df=df_test)

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]

cat_sz

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

emb_szs

#m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
#                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
#lr = 1e-3

#m.lr_find()

#m.sched.plot(10)

Sample

#m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
#                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
#lr = .05

#m.fit(lr, 3)

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-4

m.fit(lr, 1, cycle_len=7)

torch.save(m,'Model.pt')

m = torch.load('Model.pt')

Predictions

pred_test = m.predict(True)

Instead of using the test_df like above, I would like to use a different dataset.

df_sp = pd.read_csv('../DATA/WORK/sp.txt',',')

joined_valid=joined_test

joined_valid['predictor']=pred_test

joined_valid = joined_valid.dropna(subset=[id']).copy()

joined_valid['id'] = joined_valid['id'].astype('int').copy()

joined_valid = pd.merge(joined_valid, df_a, on=['id']).copy()

joined_valid = joined_valid[joined_valid.status == 'A']

joined_valid = joined_valid.sort_values(by=['predictor']).copy()

csv_fn=f'{PATH}/tmp/finalresults.csv'

joined_valid[['Month','Day','id','predictor','p_id','bn', 'status']].to_csv(csv_fn, index=False)

FileLink(csv_fn)

I figured out how to predict on multiple different test dataset:

df_test, _, nas, mapper = proc_df(joined_test, 'hit_ind', do_scale=True, mapper=mapper, na_dict=nas)

cds = ColumnarDataset.from_data_frame(df_test,cat_flds=cat_vars)

dl = DataLoader(cds)

predictions = m.predict_dl(dl)

Thank you for everyone’s suggestions on how to accomplish this. I really appreciate @fredguth, @Patrick, and @Gabriel_Syme for using your time to help me out. Everyone was a big help. Thank you very much!

3 Likes

Hi @MarkDel i just started using fastai and i am pretty new to deep learning. I am trying to solve your original question for a new data as well. I am building a web interface using python and flask. The goal is to collect new variables from the user and make predictions using the saved model that was trained previously. I am however stuck as i dont know how to load the saved model and make predictions in a different notebook without refitting the model. Could you share the process you used, maybe that’ll help me. Thank you.

Found out the problem was with the mapper. I saved it using pickle and loaded it into the new notebook and everything seems fine now. I will be interested to know if anyone has had a similar issue and if there is a better solution that i have missed. Thank you.

Hi @omolorun,
I had a similar problem with saving and loading my model on my AWS EC2. I never saved it using pickle so that may have been my problem too, but to solve the issue I used pytorch which is the library the fast.ai library is built on top of. I just imported pytorch and then used the built in functions to save and load function in pytorch which then allowed me to reuse the saved model.

If you’re using a Learner object I’ve found the easiest way to predict on new data is to create a new dataloader the same way you created the original, then have the Learner predict on it by calling learn.predict_dl(new_dataloader)

newer version of fastai don’t have pred_dl.

Does anyone have a solution for this in the version 1 of fast.ai.The new version doesnt have predict_dl nor does it have .set_data method.

Guys, any solution with newer version of fastai?

Creating a learner with

learn.fit_one_cycle(5)

and i can check the prediction from below code on test set

preds_test, _ = learn.get_preds(ds_type=DatasetType.Test)
pred_test_prob, pred_test_class = preds_test.max(1)

How to check prediction on a new dataset which is not a databunch? I am trying below

validateObj = TabularList.from_df(validate, cat_names=cat_names, cont_names=cont_names, procs=procs)

preds, _ = learn.get_preds(validateObj)
pred_prob, pred_class = preds.max(1)

but its giving me an array of 108000 compare to 45000 validation set.

is there a simple way to score new data using above learner?