Recreate Translation notebook in v1 - Problem with pad_collate

Hey,

I am currently redoing the translation part and am trying to do it in fastai v1. I want to train an Image caption generator and I want to start by training a seq2seq auto encoder. The model is supposed to learn to map a sentence into itself.

I have troubles creating the Dataset in a way that the function pad_allocate accepts a batch.

The dataset was created the following way in the fastai v0.7 translate notebook:

def A(*a):
    """convert iterable object into numpy array"""
    return np.array(a[0]) if len(a)==1 else [np.array(o) for o in a]

class Seq2SeqDataset(Dataset):
    def __init__(self, x, y):
        self.x, self.y = x, y
    def __getitem__(self, idx):
        return A(self.x[idx], self.y[idx])
    def __len__(self):
        return len(self.x)

Problem:

trn_dl = DataLoader(dataset=trn_ds, batch_size=bs, sampler=trn_sampler, collate_fn=pad_collate)

batch = next(iter(trn_dl))

fails with:

can’t convert np.ndarray of type numpy.object_. The only supported types are: double, float, float16, int64, int32, and uint8.

I have tried to break my problem down:

trn_ds is such a dataset and [trn_ds[0]] gives:

[[array([   9,  411, 1019,  700,  498,    1]),
  array([   9,  411, 1019,  700,  498,    1])]]

And pad_collate([trn_ds[0]], pad_idx=1, pad_first=False) gives;

(tensor([[   9,  411, 1019,  700,  498,    1]]),
 tensor([[   9,  411, 1019,  700,  498,    1]]))

So far so good.
However, if I try to pass more than one sentence to pad_collate it fails with the same error message as above:

[trn_ds[0], trn_ds[2]] is:

[[array([   9,  411, 1019,  700,  498,    1]),
  array([   9,  411, 1019,  700,  498,    1])],
 [array([  51, 4386,   68,  193,   12,  107,   11,    9, 2768,    1]),
  array([  51, 4386,   68,  193,   12,  107,   11,    9, 2768,    1])]]

And pad_collate([trn_ds[0], trn_ds[2]], pad_idx=1, pad_first=False) gives the same error:

can’t convert np.ndarray of type numpy.object_. The only supported types are: double, float, float16, int64, int32, and uint8.

This line causes the error:

tensor(np.array([trn_ds[0][1], trn_ds[1][1]]))

Ok, it is kind of obvious that this can’t be converted to a tensor. But how do I have to construct the input to pad_collate so that it accepts a batch?

Thanks in advance!

F

Ok, I figured this out:

Basically the pad_collate function fast.ai provides is not meant for seq2seq models so I wrote my own one that should work:

def pad_collate_seq2seq(samples:BatchSamples, pad_idx:int=1, pad_first:bool=True, backwards:bool=False, transpose:bool=False) -> Tuple[LongTensor, LongTensor]:
    "Function that collect samples and adds padding. Flips token order if needed"
    samples = to_data(samples)
    max_len_inp = max([len(s[0]) for s in samples])
    max_len_out = max([len(s[1]) for s in samples])
    
    res_inp = torch.zeros(len(samples), max_len_inp).long() + pad_idx
    res_out = torch.zeros(len(samples), max_len_out).long() + pad_idx
    
    if backwards: pad_first = not pad_first
    for i,s in enumerate(samples):
        if pad_first: 
            res_inp[i,-len(s[0]):] = LongTensor(s[0])
            res_out[i,-len(s[1]):] = LongTensor(s[1])
        else:         
            res_inp[i,:len(s[0]):] = LongTensor(s[0])
            res_out[i,:len(s[1]):] = LongTensor(s[1])
    if backwards:
        res = res.flip(1)
    if transpose:
        res_inp.transpose_(0,1)
        res_out.transpose_(0,1)
    return res_inp, res_out

Hope, this helps someone :slight_smile: