Weird behavior when iterating over a Dataloader

wgpubs · February 21, 2020, 1:21am

Code:

print((
    f'inf_df items: {len(inf_df)} | test_dl items: {inf_dl.n}\n'
    f'batch size: {inf_dl.bs}'))
# inf_df items: 11612 | test_dl items: 11612
# batch size: 64

c = 0
for b in inf_dl:
    c += b[0].shape[0]
print(c)
# c = 11584 (expected 11612)

Seems like its dropping the last incomplete batch but not sure why???

muellerzr · February 21, 2020, 1:27am

How are you making the DataLoader? IIRC drop_last is a parameter. Try setting it to False? (Or something similar)

wgpubs · February 21, 2020, 1:29am

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok), 
    Numericalize(vocab=lm_vocab)
]

test_ds = Datasets(items=inf_df, tfms=[tfms], dl_type=SortedDL)
test_dls = test_ds.dataloaders(bs=64, seq_len=bptt,
                               before_batch=partial(pad_input_chunk, pad_first=backwards))

if (backwards): test_dls.tfms.add(Transform(lambda nums: nums.flip(0)))

# use the test_dls.train dataloader for batch inference!
inf_dl = test_dls.train

I think it has something to do with the library, by default, dropping the last batch in the training set. If so, how do I turn off that behavior?

muellerzr · February 21, 2020, 1:30am

Yes. It will defaultly do that to the training set only. This behavior was the same in v1. I’d explore what parameters can be passed into .dataloaders() (not in front of a computer)

(Something like drop_last is what you’re looking for)

wgpubs · February 21, 2020, 1:33am

Lo and behold … setting shuffle_train=False in the call to .dataloaders(...) does the trick!