Unable to get Predictions on validation dataset v2

Hello All,

I ma trying to get predictions and metric score on my validation data using fastai v2 library. Can someone help me out?

I am using below code to create DataBlock and classifier

dls_clas = DataBlock(
blocks=(TextBlock.from_df(‘text’, vocab=dls_lm.vocab), CategoryBlock),

get_x=ColReader(‘text’),
get_y=ColReader(‘label’),
splitter=RandomSplitter()
).dataloaders(data, seq_len=72, bs=64)

learn_clas = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()

If possible also let me know how to get train(transformed text prefered) and validation data with predictions to csv

Regards,
Chaitanya Kanth.

1 Like

Hi @Chaitanyakanth,

You can get your train predictions and corresponding (transformed) input texts with learn_clas.get_preds(ds_idx=0, with_input=True). For the validation set this would be learn_clas.get_preds(with_input=True) which uses ds_idx=1 by default.

Then you can simply pack the input texts and predictions into a dataframe and save it as a csv file.

Note that this returns predictions and input texts not in the order as in your original dataframe. If you want to get results in the same order, you can use dls_clas.train.get_idxs() and dls_clas.valid.get_idxs() and then sort your results accordingly.

3 Likes

@stefan-ai How about on the test set?

Thanks for answer @stefan-ai. It worked.

However I am unable to fetch my actual target labels from DataLoader or classifier. I have only encoded values of the target labels.

Can you let me know how to achieve that dictionary mapping of labels?

First create a test dataloader:

test_dl = learn.dls.test_dl(test_df)

and then get your predictions:

learn.get_preds(dl=test_dl, with_input=True)

4 Likes

You can find your class names in dls_clas.vocab[1]. Indexing into that with your encoded targets will return the corresponding class name.

1 Like

Thanks!!!

Hello stefan,

I did use the code for validation and it worked. However, if I use it for fetching predictions on my train using the code below, I get this error. Any idea what could be the issue

probs_train, targets_train, preds_train = learn_clas.get_preds(ds_idx=0, with_decoded=True)

Error stack trace:

RuntimeError Traceback (most recent call last)
in ()
----> 1 probs_train, targets_train, preds_train = learn_clas.get_preds(ds_idx=0, with_decoded=True)

1 frames
/usr/local/lib/python3.6/dist-packages/fastai/learner.py in get_preds(self, ds_idx, dl, with_input, with_decoded, with_loss, act, inner, reorder, cbs, **kwargs)
240 res[pred_i] = act(res[pred_i])
241 if with_decoded: res.insert(pred_i+2, getattr(self.loss_func, ‘decodes’, noop)(res[pred_i]))
–> 242 if reorder and hasattr(dl, ‘get_idxs’): res = nested_reorder(res, tensor(idxs).argsort())
243 return tuple(res)
244 self._end_cleanup()

/usr/local/lib/python3.6/dist-packages/fastai/torch_core.py in tensor(x, *rest, **kwargs)
125 else _array2tensor(x) if isinstance(x, ndarray)
126 else as_tensor(x.values, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
–> 127 else as_tensor(x, **kwargs) if hasattr(x, ‘array’) or is_iter(x)
128 else _array2tensor(array(x), **kwargs))
129 if res.dtype is torch.float64: return res.float()

RuntimeError: Could not infer dtype of iterator

Regards,
Chaitanya Kanth.

Hello @stefan-ai,

I am trying to get predictions to attached to my dataframe. So I have called dls_clas.train.get_idxs() and got the row indices of data that were part of training. Then i subset them from my actual dataframe like below

train_ids = list(dls_clas.train.get_idxs())
train_data = data.loc[train_ids, :]

Then I fetch my actual targets and corresponding predictions using below line

probs_train, targets_train, preds_train = learn_clas.get_preds(ds_idx=0, with_decoded=True, reorder=False)

Then I am attaching targets and predictions to subset dataframe like below

train_data[‘actual_label’] = targets_train

train_data[‘predicted_label’] = preds_train

I took the class to index mapping using

class_enc = dls_clas.vocab[1]
class_enc.o2i

If I check the head of my subset dataframe, the actual label(in ‘text’ form) and encoding number doesnt match with o2i dictionary

Please let me know if I am missing something or any thing extra has to be done

Regards,
Chaitanya Kanth.

I think it’s because you have wrong ordering.

When you create train_data you simply select the items that are part of the training set, but they will not have the same order as the output from learn_clas.get_preds.

What I recommend to ensure the correct ordering is the following:

  1. get your training dataframe with your original texts and labels
  2. get predictions and set with_inputs=True
  3. get indices using .get_idxs()
  4. create a new dataframe that contains predictions and input texts from step 2 and indices from step 3
  5. sort the dataframe from step 4 according to the indices from step 3
  6. join your original dataframe from step 1 with the new re-ordered dataframe from step 5
  7. you can ensure that the ordering is correct by comparing original text and input text for a few examples

In this case the dictionary should map your original labels in text form

Thanks for detailed explanation @stefan-ai. However I have couple of doubts with three steps

Step1) I don’t have a separate train dataframe (no is_valid column) and I am using RandomSplitter() to get train and validation dataframes. So I had used output of .get_idxs() to seperate training data and validation data. So are these ids returned, not exactly row indices in our original dataframe ?

Step5) Can you elaborate on Sort? Because we have a dataframe and list of row indices at hand.

Step6) For “Join” we need one common attribute and if get_idxs() doesnt return row indices , how can we perform join on original dataframe? Even if we perform join on this get_idxs() output , I will be having original problem that I started with, right ?

I thought of this process to be simpler, but turns out I need more information

Regards,
Chaitanya kanth

Right, the easiest fix would be to randomly shuffle your dataframe, add an is_valid column and set it for the last x% of examples to True. Then you can use ColSplitter() instead of RandomSplitter().

See step 4. If you add the row indices to your dataframe you can use df.sort_values(by='...')

Sorry, wrong formulation on my part. You can just add the columns from your re-ordered dataframe with results to your original dataframe, since they should now have the same ordering.

Hello @stefan-ai,

I have tried the steps mentioned by you, but still could not get the ordering right. I gave lot of thought to this process and want to confirm few things before we proceed any further

  1. Does .get_idxs() return row indices from our original dataframe ? My answer is yes to this. Please correct if wrong.
  2. The ordering of predictions from the below line of code is in same order as row indices returned by above code in step 1?
    If YES, then subsetting rows(from original dataframe) using indices from .get_idxs() should exactly match our predictions and targets, which is not the case
    If NO, is it possible at all to get the right order? Because we dont know the ‘order’ in which they are returned

probs_train, targets_train, preds_train = learn_clas.get_preds(ds_idx=0, with_decoded=True, reorder=False)

I have also tried the steps that you have mentioned, but in vain. Below is step by step analysis

probs_train, targets_train, preds_train = learn_clas.get_preds(ds_idx=0, with_decoded=True, reorder=False)

train_ids = list(dls_clas.train.get_idxs())

recons_train_df = pd.DataFrame({‘TrainID’: train_ids, ‘Targets’: targets_train,‘Predictions’: preds_train})
recons_train_df.sort_values(by=‘TrainID’, inplace=True)

recons_train_df.head()

train_data = data.loc[train_ids, :]#Subsetting the indices from original dataframe

train_data_sort = train_data.sort_index(axis=0, ascending=True) #checked with sort #and w/o sort. Labels doesnt match

train_data_sort.head()

Sorry for the long post. But it is kind of important to me to perform error analysis on the results.

Regards,
Chaitanya Kanth.

Yes, that’s the case.

Yes, that also should be the case. What’s going on behind the scenes is that the training dataloader sorts the texts roughly by length (while the validation dataloder sorts exactly by length) to get rid of excessive padding. Both .get_preds() and .get_idxs() then return results in the same order as the dataloader. So if you pack the outputs of both methods into a dataframe and sort it according to the indices, you should get back the same ordering as your original dataframe.

Tbh, I’m a bit out of ideas here :thinking: The procedure I described is a bit of an unforunate workaround but it always worked for me.

Just to be clear. The issue you’re having is that the labels in train_data_sort are not the same as the Targets in recons_train_df, right? So e.g. dls.vocab[1][361] does not return “CLAMP”?

Thanks for the clarification @stefan-ai

Yes. Class at 361 is not ‘CLAMP’ as expected to be.

One more doubt that I have is,

with_input=True

in

get_preds()

function, it returns a numerical representation of the text. Right ? This mapping of number to tokens can be found in

dls.vocab[0]

I am trying to undo this numericalization (to get text representation) for each of the entry (with help from above dictionary). However I am unable to parse through TensorText type of the list representation highlighted below

Any idea on how to get text representation of the data entry here? I have gone through the docs.fast.ai for TensorText, but couldn’t find any useful methods.

Regards,
Chaitanya Kanth

You can use this loop for decoding the TensorTexts:

decoded = []
for i in range(len(out[0])):
    dec = [learn.dls.vocab[0][o] for o in out[0][i]]
    decoded.append(' '.join(dec))
1 Like