Weird behavior when iterating over a Dataloader

Code:

print((
    f'inf_df items: {len(inf_df)} | test_dl items: {inf_dl.n}\n'
    f'batch size: {inf_dl.bs}'))
# inf_df items: 11612 | test_dl items: 11612
# batch size: 64

c = 0
for b in inf_dl:
    c += b[0].shape[0]
print(c)
# c = 11584 (expected 11612)

Seems like its dropping the last incomplete batch but not sure why???

How are you making the DataLoader? IIRC drop_last is a parameter. Try setting it to False? (Or something similar)

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok), 
    Numericalize(vocab=lm_vocab)
]

test_ds = Datasets(items=inf_df, tfms=[tfms], dl_type=SortedDL)
test_dls = test_ds.dataloaders(bs=64, seq_len=bptt,
                               before_batch=partial(pad_input_chunk, pad_first=backwards))

if (backwards): test_dls.tfms.add(Transform(lambda nums: nums.flip(0)))

# use the test_dls.train dataloader for batch inference!
inf_dl = test_dls.train

I think it has something to do with the library, by default, dropping the last batch in the training set. If so, how do I turn off that behavior?

Yes. It will defaultly do that to the training set only. This behavior was the same in v1. I’d explore what parameters can be passed into .dataloaders() (not in front of a computer)

(Something like drop_last is what you’re looking for)

1 Like

Lo and behold … setting shuffle_train=False in the call to .dataloaders(...) does the trick!

2 Likes