I am experiencing an issue with prediction in fastai 1.0.52.
I have trained a
text.text_classifier_learner. When I use
get_predict on the train data, I get a tensor smaller than expected (less rows than in the train dataset).
# from fastai import text pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train) In : pred_tens.shape Out: torch.Size([1920, 29])
So the output is of size 1920. However, the expected size is 1952, as may be seen by looking at the data attached to my learner.
print(learn_cls.data) TextClasDataBunch; Train: LabelList (1952 items) x: TextList xxbos xxmaj hey there, some text y: MultiCategoryList CAT1,CAT2,... Path: E:\myproject\data\full_samples\ulmfit; Valid: LabelList (487 items) x: TextList xxbos hello . some text y: MultiCategoryList CAT1,CAT2,... Path: E:\myproject\data\full_samples\ulmfit; Test: None
Any idea regarding how I can troubleshoot this issue?
(I do not encounter this issue with the validation dataset. I have also added a test dataset at some point, and did not encounter any issue either)
If it helps, here is more detail on how I constructed the classification databunch:
from fastai import text #... # At some point, I get a databunch data_lm, which contains the vocab used in the # encoder I will transfer in my model processor = [text.TokenizeProcessor(tokenizer=tokenizer), text.NumericalizeProcessor()] data_cls = text.TextList.from_df(df=df, cols="textcol", processor=4, vocab=data_lm.vocab, path="E:/myproject/data/full_samples/ulmfit") data_cls = data_cls.split_by_rand_pct(valid_pct=.2) data_cls = data_cls.label_from_df(cols="labelcol") data_cls = data_cls.databunch(bs=64, num_workers=0)
Also, something interesting happens when I ask for ordered predictions (
pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train, ordered=True) # long error log, which ends with ~\AppData\Local\Continuum\anaconda3\envs\caap\lib\site-packages\fastai\text\learner.py in <listcomp>(.0) 89 sampler = [i for i in self.dl(ds_type).sampler] 90 reverse_sampler = np.argsort(sampler) ---> 91 preds = [p[reverse_sampler] for p in preds] 92 return(preds) 93 IndexError: index 1921 is out of bounds for dimension 0 with size 1920
Using python’s debugger, I saw that
reverse_sampler is as expected of shape