Prediction for Seq2Seq task Machine Translation

def get_predictions(learn, ds_type=DatasetType.Valid):
learn.model.eval()
inputs, targets, outputs = [],[],[]
with torch.no_grad():
    for xb,yb in progress_bar(learn.dl(ds_type)):
        out = learn.model(*xb)
        for x,y,z in zip(xb[0],xb[1],out):
            inputs.append(learn.data.train_ds.x.reconstruct(x))
            targets.append(learn.data.train_ds.y.reconstruct(y))
            outputs.append(learn.data.train_ds.y.reconstruct(z.argmax(1)))
           
return inputs, targets, outputs 

The above code is copied from Transformers Notebook Course-NLP.The above code does the prediction for Valiation set.Now,suppose i try to use the same code for test set and pass DatasetType.Test I get a error,this error is thrown by Seq2seqcollate function in the same notebook because it expects both source and target sentences and our target sentences are empty.I’m looking for help about how I can make it work. @jeremy @rachel

Just remove everything with the label. IE y, targets, yb

def get_predictions(learn, ds_type=DatasetType.Test):
learn.model.eval()
inputs, outputs = [],[]
with torch.no_grad():
    for xb in progress_bar(learn.dl(ds_type)):
        out = learn.model(*xb)
        for x,z in zip(xb[0],out):
            inputs.append(learn.data.train_ds.x.reconstruct(x))
            outputs.append(learn.data.train_ds.x.reconstruct(z.argmax(1)))
return inputs , outputs

I’m still getting the error that traces back to seq2seq_collate.Does the above code look fine?.

We need a full stack trace to be able to do or understand anything.

AttributeError                            Traceback (most recent call last)
<ipython-input-90-87584a2e7cf8> in <module>
----> 1 inputs,outputs = get_predictions(learn)

<ipython-input-89-524a832f669b> in get_predictions(learn, ds_type)
      4     with torch.no_grad():
      5         for xb in progress_bar(learn.dl(ds_type)):
----> 6             out = learn.model(*xb)
      7             for x,z in zip(xb[0],xb[1],out):
      8                 x=x.cpu()

~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

<ipython-input-42-b3240bc106f1> in forward(self, inp, out)
     13     def forward(self, inp, out):
     14         mask_out = get_output_mask(out, self.pad_idx)
---> 15         enc,out = self.enc_emb(inp),self.dec_emb(out)
     16         enc = compose(self.encoder)(enc)
     17         out = compose(self.decoder)(out, enc, mask_out)

~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

<ipython-input-35-4d1444b405c5> in forward(self, inp)
      9 
     10     def forward(self, inp):
---> 11         pos = torch.arange(0, inp.size(1), device=inp.device).float()
     12         return self.drop(self.embed(inp) * math.sqrt(self.emb_sz) + self.pos_enc(pos))

AttributeError: 'list' object has no attribute 'size'

I suspect that if you unpack your test dataloader into xb and pass it to the model in out = learn.model(*xb), you will pass both texts and empty labels to it. Maybe try this:

def get_predictions(learn, ds_type=DatasetType.Test):
learn.model.eval()
inputs, outputs = [],[]
with torch.no_grad():
    for xb,_ in progress_bar(learn.dl(ds_type)):
        out = learn.model(*xb)
        for x,z in zip(xb, out):
            inputs.append(learn.data.train_ds.x.reconstruct(x))
            outputs.append(learn.data.train_ds.y.reconstruct(z.argmax(1)))
return inputs, outputs

Hi @stefan-ai thanks for the response,I tried the code block you mentioned and i get the following error.

TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/nayakp/.local/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/nayakp/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "<ipython-input-13-21b2d4be4331>", line 5, in seq2seq_collate
    max_len_y=max([len(s[1]) for s in samples])
  File "<ipython-input-13-21b2d4be4331>", line 5, in <listcomp>
    max_len_y=max([len(s[1]) for s in samples])
TypeError: object of type 'int' has no len() 

I feel all this has something to do with the way seq2seqcollate works.Not Sure

Where do you think I’m going wrong?

Yes, could be. I tried to run the notebook myself and also ran into problems. Could you share your code how you add your test set?

Hi @stefan-ai I use the add_test option and pass the dataframe i.e. data.add_test(test_df).Thanks for helping me out.

@stefan-ai any luck?

Hi @prashanth

Yes and no. I don’t think it’s possible to run the get_predictions function directly on a test set. However, I found a workaround that allows you to get predictions on an unlabeled test set.

First of all, you need to use an explicit validation set during training so that you can drop it later on. You can simply shuffle your whole dataframe, add a new column that indicates if a row is in the validation set or not and then use that column to split your data.

df = df.sample(frac=1).reset_index(drop=True)
df['is_valid'] = None
df.iloc[:int(len(df)*0.8),2] = False
df.iloc[int(len(df)*0.8):,2] = True
src = Seq2SeqTextList.from_df(df, path = path, cols='fr').split_from_df(col='is_valid').label_from_df(cols='en', label_cls=TextList)

Then you continue the usual way, train the model on the training set and evaluate it on the validation set. After that, you create a new dataframe by concatenating the training dataframe and the test dataframe. The test dataframe needs to have the same columns (source language, target language and is_valid, which has to be set to True). The target language column can be empty, but you have to use an empty string instead of None. Finally you use the new dataframe to create a new databunch using the vocab from the old databunch.

df_new = pd.concat([df[df.is_valid==False], df_test])
df_new.reset_index(drop=True, inplace=True)
src_new = Seq2SeqTextList.from_df(df_new, path = path, cols='fr', vocab=data.vocab).split_from_df(col='is_valid').label_from_df(cols='en', label_cls=TextList)
data_new = src_new.databunch()

Now you have to slightly modify the get_predictions function. Also, for some reason reconstruct didn’t work for me anymore, so I rewrote it (however in a much slower way).

def get_predictions_test(learn, ds_type=DatasetType.Valid):
    learn.model.eval()
    inputs, outputs = [],[]
    with torch.no_grad():
        for xb,_ in progress_bar(data_new.dl(ds_type)):
            out = learn.model(xb)
            for x,z in zip(xb,out):
                inputs.append(' '.join([learn.data.x.vocab.itos[i] for i in x if not i in list(range(1,9))]))
                outputs.append(' '.join([learn.data.y.vocab.itos[i] for i in z.argmax(1) if not i in list(range(1,9))]))
    return inputs, outputs

And then you can get your predictions on the test set, which technically is the validation set of the new databunch - therefore ds_type=DatasetType.Valid stays.

inputs, outputs = get_predictions_test(learn)

Certainly not the most elegant way, but it works :slight_smile: