NLP Custom dataset - debugging "given sequence has an invalid size"

Chris_Palmer · November 26, 2017, 6:40am

I have created a custom data set of Reddit submissions. During the processing of the Language Model one of the files has caused an error (see below) of “given sequence has an invalid size of dimension 2: 0”. Although I have tested that the text in my files are at least 35 characters long, its possible that this file has a problem such as just being one word - such as a URL.

I base this idea on finding the followng in the Pytorch Tensor.cpp on github…

THPUtils_assert(length > 0, "given sequence has an invalid size of "
          "dimension %" PRId64 ": %" PRId64, (int64_t)sizes.size(), (int64_t)length);

I will check my data set, but finding this error I would love to be able to debug it - can anyone tell me how? Is there a way to debug a jupyter notebook trace?

And can anyone, perhaps @jeremy, confirm what they know would cause this error?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-0e98c1d5fc20> in <module>()
      1 FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
----> 2 md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

~/fastai/courses/dl1/fastai/nlp.py in from_text_files(cls, path, field, train, validation, test, bs, bptt, **kwargs)
    241                                     path, text_field=field, train=train, validation=validation, test=test)
    242 
--> 243         return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)
    244 
    245 

~/fastai/courses/dl1/fastai/nlp.py in __init__(self, path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)
    220 
    221         self.trn_dl, self.val_dl, self.test_dl = [ LanguageModelLoader(ds, bs, bptt) 
--> 222                                                     for ds in (self.trn_ds, self.val_ds, self.test_ds) ]
    223 
    224     def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):

~/fastai/courses/dl1/fastai/nlp.py in <listcomp>(.0)
    220 
    221         self.trn_dl, self.val_dl, self.test_dl = [ LanguageModelLoader(ds, bs, bptt) 
--> 222                                                     for ds in (self.trn_ds, self.val_ds, self.test_ds) ]
    223 
    224     def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):

~/fastai/courses/dl1/fastai/nlp.py in __init__(self, ds, bs, bptt)
    132         text = sum([o.text for o in ds], [])
    133         fld = ds.fields['text']
--> 134         nums = fld.numericalize([text])
    135         self.data = self.batchify(nums)
    136         self.i,self.iter = 0,0

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torchtext/data/field.py in numericalize(self, arr, device, train)
    296                 arr = self.postprocessing(arr, None, train)
    297 
--> 298         arr = self.tensor_type(arr)
    299         if self.sequential and not self.batch_first:
    300             arr.t_()

RuntimeError: given sequence has an invalid size of dimension 2: 0