wgpubs
(WG)
February 21, 2020, 1:21am
1
Code:
print((
f'inf_df items: {len(inf_df)} | test_dl items: {inf_dl.n}\n'
f'batch size: {inf_dl.bs}'))
# inf_df items: 11612 | test_dl items: 11612
# batch size: 64
c = 0
for b in inf_dl:
c += b[0].shape[0]
print(c)
# c = 11584 (expected 11612)
Seems like its dropping the last incomplete batch but not sure why???
muellerzr
(Zachary Mueller)
February 21, 2020, 1:27am
2
How are you making the DataLoader? IIRC drop_last is a parameter. Try setting it to False? (Or something similar)
wgpubs
(WG)
February 21, 2020, 1:29am
3
tfms = [
attrgetter('text'),
Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok),
Numericalize(vocab=lm_vocab)
]
test_ds = Datasets(items=inf_df, tfms=[tfms], dl_type=SortedDL)
test_dls = test_ds.dataloaders(bs=64, seq_len=bptt,
before_batch=partial(pad_input_chunk, pad_first=backwards))
if (backwards): test_dls.tfms.add(Transform(lambda nums: nums.flip(0)))
# use the test_dls.train dataloader for batch inference!
inf_dl = test_dls.train
I think it has something to do with the library, by default, dropping the last batch in the training set. If so, how do I turn off that behavior?
muellerzr
(Zachary Mueller)
February 21, 2020, 1:30am
4
Yes. It will defaultly do that to the training set only. This behavior was the same in v1. I’d explore what parameters can be passed into .dataloaders() (not in front of a computer)
(Something like drop_last is what you’re looking for)
1 Like
wgpubs
(WG)
February 21, 2020, 1:33am
5
Lo and behold … setting shuffle_train=False
in the call to .dataloaders(...)
does the trick!
2 Likes