Hi there,
I am experiencing an issue with prediction in fastai 1.0.52.
I have trained a text.text_classifier_learner
. When I use get_predict
on the train data, I get a tensor smaller than expected (less rows than in the train dataset).
# from fastai import text
pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train)
In [17]: pred_tens[0].shape
Out[17]: torch.Size([1920, 29])
So the output is of size 1920. However, the expected size is 1952, as may be seen by looking at the data attached to my learner.
print(learn_cls.data)
TextClasDataBunch;
Train: LabelList (1952 items)
x: TextList
xxbos xxmaj hey there, some text
y: MultiCategoryList
CAT1,CAT2,...
Path: E:\myproject\data\full_samples\ulmfit;
Valid: LabelList (487 items)
x: TextList
xxbos hello . some text
y: MultiCategoryList
CAT1,CAT2,...
Path: E:\myproject\data\full_samples\ulmfit;
Test: None
Any idea regarding how I can troubleshoot this issue?
(I do not encounter this issue with the validation dataset. I have also added a test dataset at some point, and did not encounter any issue either)
(Appendix)
If it helps, here is more detail on how I constructed the classification databunch:
from fastai import text
#...
# At some point, I get a databunch data_lm, which contains the vocab used in the
# encoder I will transfer in my model
processor = [text.TokenizeProcessor(tokenizer=tokenizer),
text.NumericalizeProcessor()]
data_cls = text.TextList.from_df(df=df, cols="textcol", processor=4,
vocab=data_lm.vocab, path="E:/myproject/data/full_samples/ulmfit")
data_cls = data_cls.split_by_rand_pct(valid_pct=.2)
data_cls = data_cls.label_from_df(cols="labelcol")
data_cls = data_cls.databunch(bs=64, num_workers=0)
Also, something interesting happens when I ask for ordered predictions (ordered=True
), aka
pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train, ordered=True)
# long error log, which ends with
~\AppData\Local\Continuum\anaconda3\envs\caap\lib\site-packages\fastai\text\learner.py in <listcomp>(.0)
89 sampler = [i for i in self.dl(ds_type).sampler]
90 reverse_sampler = np.argsort(sampler)
---> 91 preds = [p[reverse_sampler] for p in preds]
92 return(preds)
93
IndexError: index 1921 is out of bounds for dimension 0 with size 1920
Using python’s debugger, I saw that reverse_sampler
is as expected of shape (1952,)
.