Making predictions in v1

dhoa · December 10, 2018, 2:22pm

Oh it is because the post is in Part1_V3 group and you don’t have access to. Sorry

I quote it here:

I haven’t played with tabular module yet in fastaiV1 but I guess it has similar function as add_test_df ?

sgugger · December 10, 2018, 3:17pm

It’s going to be add_test in this case, where you can put your TabularList.from_df(...) for the test dataframe.

Edit: the inference tutorial had been update to show this.

shaun1 · December 13, 2018, 5:55pm

In the case of a tabular learner on a regression problem with a test set added during data bunch creation, what are the values in the tuples returned? The code says:

    "Tuple of predictions and targets, and optional losses (if `loss_func`) using `dl`, max batches `n_batch`."

So, using Rosmann as an example, what are the tensors returned by get_preds

preds = learn.get_preds(ds_type=DatasetType.Test)
preds

[tensor([[ 8.3814],
         [ 8.9268],
         [ 9.1673],
         ...,
         [ 8.7692],
         [10.0370],
         [ 8.8672]]),
 tensor([2.1998, 2.1998, 2.1998,  ..., 2.1998, 2.1998, 2.1998])]

sgugger · December 13, 2018, 6:24pm

The first tensor contains your predictions, the second your targets (dummy targets since you’re using the test set, which is why you have the same thing all the time).

tcapelle · December 19, 2018, 8:46pm

TabularList does not have the load_empty method. Don’t know how to add_test the data without creating the whole dataset from scratch.

sgugger · December 19, 2018, 8:49pm

You’re not supposed to use TabularList but LabelLists for load_empty as shown in the tutorial.

tcapelle · December 19, 2018, 8:54pm

LabelList has the load_empty method, but does not have the add_test method, conversly LabelListS has the add_test but not the load_empty.
I am confused…

My bad, it has been fixed. Now both have both methods.

AdrianT · December 24, 2018, 8:58am

Following up on your answer to shaun1, I tried to make predictions on the bulldozers dataset using the rossmann code. However, the first tensor (which contains the predictions) appears to have many more rows compared to the test dataframe i.e. expect 12457 rows but got 320900 instead. Am I interpreting it wrongly or does my mistake lie somewhere else?

sgugger · December 24, 2018, 9:13am

Normally get_preds takes a DatasetType now, not a dataset directly. Which version of fastai are you using?

AdrianT · December 24, 2018, 9:27am

I am using 1.0.38. It seems like I have given it an invalid input, how should I get the predictions instead? I could not figure it out from the documentation. Thanks in advance!

sgugger · December 24, 2018, 9:43am

Like I said, pass a DatasetType, like DatasetType.Test for test.

shaun1 · December 25, 2018, 2:50pm

I created a language model using custom data and saved the model to disk. I was able to call learn.predict() in the same session and get predicted words/sentences. However, in a new session, after initializing the learning and loading my saved model, when I call learn.predict(), I got the following error:

learn.predict('New computer keyboard', n_words=50)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-6b0b245e79d0> in <module>
----> 1 learn.predict('New computer keyboard')

~/fastai/fastai/text/learner.py in predict(self, text, n_words, no_unk, temperature, min_p)
     93         self.model.reset()
     94         for _ in progress_bar(range(n_words), leave=False):
---> 95             xb, yb = self.data.one_item(text)
     96             xb = xb.view(-1,1)
     97             res = self.pred_batch(batch=(xb,yb))[0][-1]

~/fastai/fastai/basic_data.py in one_item(self, item, detach, denorm)
    153         "Get `item` into a batch. Optionally `detach` and `denorm`."
    154         ds = self.single_ds
--> 155         with ds.set_item(item):
    156             return self.one_batch(ds_type=DatasetType.Single, detach=detach, denorm=denorm)
    157 

/net/vaosl01/opt/NFS/su0/anaconda3/envs/mer/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/fastai/fastai/data_block.py in set_item(self, item)
    493     def set_item(self,item):
    494         "For inference, will briefly replace the dataset with one that only contains `item`."
--> 495         self.item = self.x.process_one(item)
    496         yield None
    497         self.item = None

~/fastai/fastai/data_block.py in process_one(self, item, processor)
     73         if processor is not None: self.processor = processor
     74         self.processor = listify(self.processor)
---> 75         for p in self.processor: item = p.process_one(item)
     76         return item
     77 

~/fastai/fastai/text/data.py in process_one(self, item)
    249         self.vocab,self.max_vocab,self.min_freq = vocab,max_vocab,min_freq
    250 
--> 251     def process_one(self,item): return np.array(self.vocab.numericalize(item), dtype=np.int64)
    252     def process(self, ds):
    253         if self.vocab is None: self.vocab = Vocab.create(ds.items, self.max_vocab, self.min_freq)

AttributeError: 'NoneType' object has no attribute 'numericalize'

Is this expected behavior? In other words, can we only predict language model sentences in the same session where we create the language model?

After loading the data, I was able to confirm that the data for the LM has a vocab class associated with it (i.e., not None). These are the steps that lead to the above error:

custom_toks = ['rrname', 'rrdocln', 'rrln', 'rrdocfn', 'rrfn', 'rrinits', 'rrhosp', 'rrwork', 'rrloc', 'rrcntry', 'rrstate', 'rraddr',\
              'rrdate', 'rrmmdd', 'rryear', 'rrmnth', 'rrhols', 'rrdtrange', 'rrpager', 'rrradclip', 'rrssn', 'rrmrno', 'rrage90',\
              'rrsno', 'rrunitno', 'rrmdno', 'rrph', 'rrpno', 'rrjobno', 'rrdicinfo', 'rrcontinfo', 'rrattinfo', 'rr3digit', 'rr2digit',\
              'rr1digit', 'rrhour', 'rrmidngt', 'rrdawn', 'rrfore', 'rrafter', 'rrdusk', 'rrngt']

tok_proc = TokenizeProcessor(tokenizer=Tokenizer(special_cases=custom_toks, n_cpus=1), mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)

data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])
data_lm.show_batch()

data_lm.vocab.numericalize is not None
True

However, when I run the debug magic and get to the step def process_one(self,item): return np.array(self.vocab.numericalize(item), dtype=np.int64), self.vocab is indeed None. I’m not sure why though.

adrian · December 26, 2018, 4:46am

I must be missing something, and have spent a good few hours trying to work this out.

Using: 1.0.39.dev0

I have a TabularDataBunch created like this:

data = TabularDataBunch.from_df(path, df=df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_vars, cont_names=cont_vars, test_df=df_test)

and a learner created as per below:

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, metrics=accuracy)

I can create a single prediction ok using:

learn.predict(df_test.iloc[0])

Category the_predicted_category,
 tensor(249),
 tensor([2.6166e-04, 7.0190e-05, 1.6028e-05, 5.5844e-06, 1.7006e-03, 1.3622e-04,
         4.4630e-04, 5.3242e-05, 5.2404e-06, 8.1808e-04, 1.8185e-05...

however I want to predict on the entire test set and get the category (ie ‘the_predicted_category’ as above for each row in the test dataset.

I can get the predictions like this:

preds, y = learn.get_preds(DatasetType.Test)

which returns:

   tensor([9.1703e-08, 2.5174e-07, 1.0606e-07, 3.4502e-08, 3.1051e-07, 4.4129e-07,
    1.0235e-06, 2.3513e-07, 3.2420e-08, 6.2728e-07, 3.2848e-07, 7.2376e-07,
    4.3272e-07, 4.3480e-07, 7.6543e-07, 4.0239e-07, 1.6350e-07, 2.8251e-07,
    3.4620e-07, 2.1245e-07, 1.3313e-07, 4.1516e-07...

Then from the index of the most likely in the tensor above, how do I get get the actual class it corresponds to?

Things I have tried:

My predicted tensor has dimension torch.Size([632])

If I get a CategoryList from the test dataset:

data.test_ds.y

It has a length of 8 and is non-unique - I presume this is giving me a batch of y data (same for data.test_dl.dataset.y)

Then if I look at:

data.train_ds.classes

the dep_var that I want to reconstruct the category name out of is not in the classes, even though it was included:

df = train_df[cat_vars + cont_vars + [dep_var]].copy()

Then when I look at TabularDataBunch.from_df() during execution:

src = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx)
                           .label_from_df(cols=dep_var, classes=classes))

data.train_dl.c

returns 632 - which looks to be the number of categories in my dep_var…

Testing the assumption that fastai just uses something similar to below to generate the categories from which test predictions are made (I couldnt find anything like this in the code), and then indexing into ‘categories_to_index_into’ to get the predicted category :

categories_to_index_into=set(train_df['Target_Column'].values)

does not give the same predicted categories as per:

for idx, row in df_test.iterrows():
    pred = learn.predict(row)

Iterating though the entire dataframe row by row is unfeasible as it contains several million rows. Has anyone managed to work this out?

sgugger · December 26, 2018, 10:53am

It looks like your test set wasn’t properly loaded here.

Note that a test set is always unlabelled, so this is fake data you’re going to see when looking at data.test_ds.y. Still, it should have the right length (which doesn’t seem to be 8 since you were talking about several millions lines). Make sure that data.test_ds.x has something that makes sense with your data.

shaun1 · December 26, 2018, 12:57pm

I followed the instructions in LM inference here:

vocab = Vocab(pickle.load(open(path/'lm-toknum'/'itos.pkl', 'rb')))
empty_data = TextLMDataBunch.load_empty(path, fname='lm-meta-db.pkl')
learn = language_model_learner(empty_data)
learn.unfreeze()
learn.load('lm-acc-583', with_opt=False)

I got an error of dimension mismatch:

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-f04e66109460> in <module>
      2 learn = language_model_learner(empty_data)
      3 learn.unfreeze()
----> 4 learn.load('lm-acc-583', with_opt=False)

~/fastai/fastai/basic_train.py in load(self, name, device, strict, with_opt)
    213         state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
    214         if set(state.keys()) == {'model', 'opt'}:
--> 215             get_model(self.model).load_state_dict(state['model'], strict=strict)
    216             if ifnone(with_opt,True):
    217                 if not hasattr(self, 'opt'): opt = self.create_opt(defaults.lr, self.wd)

/net/vaosl01/opt/NFS/su0/anaconda3/envs/mer/lib/python3.7/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    767         if len(error_msgs) > 0:
    768             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 769                                self.__class__.__name__, "\n\t".join(error_msgs)))
    770 
    771     def _named_members(self, get_members_fn, prefix='', recurse=True):

RuntimeError: Error(s) in loading state_dict for SequentialRNN:
	size mismatch for 0.encoder.weight: copying a param with shape torch.Size([60093, 400]) from checkpoint, the shape in current model is torch.Size([8, 400]).
	size mismatch for 0.encoder_dp.emb.weight: copying a param with shape torch.Size([60093, 400]) from checkpoint, the shape in current model is torch.Size([8, 400]).
	size mismatch for 1.decoder.weight: copying a param with shape torch.Size([60093, 400]) from checkpoint, the shape in current model is torch.Size([8, 400]).
	size mismatch for 1.decoder.bias: copying a param with shape torch.Size([60093]) from checkpoint, the shape in current model is torch.Size([8]).

shaun1 · December 26, 2018, 1:47pm

Ok, the problem is that empty_data = TextLMDataBunch.load_empty(path) loads with a vocab of size 8:

empty_data.vocab.itos
['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep']

even though, I did a data_lm.export() after creating my original databunch. Hence, I’m getting the mismatch error.

AbuFadl · December 26, 2018, 1:56pm

I used both viz data and inference and it worked. Just keep track on object vars as they are reused and may cause problems. Looks like you are referring to two vocabs.

salil_23 · December 26, 2018, 2:01pm

I am unable to add test dataset in TabularList.from_df as it raises some KeyError. Can someone tell me what I am doing wrong?
Edited : I narrowed down the problem, its happening because of Categorify

shaun1 · December 26, 2018, 2:12pm

I’m using the same vocab that I saved when I created my databunch for language modeling. And I exported that which is used to creating the empty_data. I’m not sure why the empty_data has only 8 words in its vocab.

sgugger · December 26, 2018, 2:37pm

Just checked on master and this part is still working. If you use pickle to open the file path/‘export.pkl’, what do you see as vocab?