Is Dataset.getitem() allowed to return more than one sample?

Max1 · January 12, 2019, 5:03pm

I’m working on a binary classification problem. My dataset is stored in an external database. So I wrote my own

class MyDS(Dataset):
. . .        
def __getitem__(self, index):
    # external database is queried here
    the_x = torch.tensor(...)
    the_y = bool(...)
    return torch.unsqueeze(the_x, 0), torch.unsqueeze(torch.tensor(the_y).float(), 0)

Would it be wrong to let __getitem__() to return a whole minibatch? That would reduce the overhead of external querying. PyTorch forces me to unsqueeze() anyhow, adding that sample-within-minibatch dimension.

Max1 · January 13, 2019, 4:20pm

Should work in PyTorch:

To get it working in fastai, I’m using batch size of one in DataBunch.create(..., bs=1). The Learner still adds a singleton outermost dimension to x and y, which I remove in the on_batch_begin callback.

class SqueezeCB(LearnerCallback):
    def on_batch_begin(self, last_input, last_target, **kwargs):
        return torch.squeeze(last_input, 0), torch.squeeze(last_target, 0)

learn = Learner(data, model, callback_fns=[partial(SqueezeCB)])

Is Dataset.__getitem__() allowed to return more than one sample?

Is Dataset.getitem() allowed to return more than one sample?