RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 0 in dimension 1

dhoa · December 23, 2018, 6:55pm

I am trying to create data with PointsItemList for a hand tracking problem. The code seems correct that data.show_batch show exactly what I want.

data = (PointsItemList.from_df(df, path=path, folder='train', cols=['frame'], suffix='.png')
        .random_split_by_pct()
        .label_from_df(cols=['loc'])
        .transform(get_transforms(), tfm_y=True, size=(120,160))
        .databunch().normalize()
       )

However learn.lr_find() or learn.fit() randomly have the error below

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-24-4dfb24161c57> in <module>()
----> 1 learn.fit_one_cycle(1)

~/fastai/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/fastai/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     80             cb_handler.on_epoch_begin()
     81 
---> 82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
     84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)

~/anaconda3/lib/python3.6/site-packages/fastprogress/fastprogress.py in __iter__(self)
     63         self.update(0)
     64         try:
---> 65             for i,o in enumerate(self._gen):
     66                 yield o
     67                 if self.auto_update: self.update(i+1)

~/fastai/fastai/basic_data.py in __iter__(self)
     68     def __iter__(self):
     69         "Process and returns items from `DataLoader`."
---> 70         for b in self.dl:
     71             #y = b[1][0] if is_listy(b[1]) else b[1] # XXX: Why is this line here?
     72             yield self.proc_batch(b)

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    337         if self.rcvd_idx in self.reorder_dict:
    338             batch = self.reorder_dict.pop(self.rcvd_idx)
--> 339             return self._process_next_batch(batch)
    340 
    341         if self.batches_outstanding == 0:

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    372         self._put_indices()
    373         if isinstance(batch, ExceptionWrapper):
--> 374             raise batch.exc_type(batch.exc_msg)
    375         return batch
    376 

RuntimeError: Traceback (most recent call last):
  File "/home/hoatruong/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 114, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/hoatruong/fastai/fastai/torch_core.py", line 105, in data_collate
    return torch.utils.data.dataloader.default_collate(to_data(batch))
  File "/home/hoatruong/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 198, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/hoatruong/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 198, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/hoatruong/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 175, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 0 in dimension 1 at /opt/conda/conda-bld/pytorch-nightly_1538905146867/work/aten/src/TH/generic/THTensorMoreMath.cpp:1317

I was thinking that the shape of every elements in data are not coherent but it is not.
shape

I found that the error might come from the data_loader when we grab a new batch. But it is quite randomly that the order when I get the error is not the same.

I hope someone can help me to clarify this problem. Thank you so much in advance

dhoa · December 23, 2018, 9:12pm

I found that the torch.size for label are sometimes [1,2] sometimes [0,2]. I think it might not come from my data because when I removed these error rows , the problem appears again when I create a new data bunch

sgugger · December 24, 2018, 9:06am

Hi there,
The problem comes from the fact that sometimes, your data augmentation will throw the point out of the image. There are three ways of dealing with it:

lowering your params of data augmentation
writing a collate function that will pad your points when they’re empty to make them the right size
using ImagePoints with remove_out=False to keep the points even if they’re out (and making the model guess from the part of the picture it can see).

dhoa · December 24, 2018, 3:59pm

Thank you so much for your help. I made it works with lowering the params for data augmentation, I will try to implement your others suggestions too.

Bonnes fêtes !!!

hud · January 14, 2019, 9:31am

I am a beginner fastai user so pardon my trivial question. I am trying to use unet_learner to create image segmentation of the different sewer conditions as shown in my gist here.

I have the same error but I am not sure if it is relevant; the full error is

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 375 and 376 in dimension 2 at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/generic/THCTensorMath.cu:83

I am confused which tensor sizes it is referring to - is it between the mask and the original image itself or the plotted segmentations? Better yet, how would I go about troubleshooting this on my own.

Hope this makes sense.

digitalspecialists · January 15, 2019, 4:30pm

This is probably because your image sizes are odd. ie 375x500. You need to make sure they are even, or better still a multiple of 32. Unfortunately unet_learner doesn’t throw a useful warning.

hud · January 15, 2019, 10:56pm

Hi

I changed size=128 but now am running with this other problem that I keep seeing RuntimeError: CUDA error: device-side assert triggered. Restarting the kernel a few times did not help - could it be a problem with the metrics ?

digitalspecialists · January 16, 2019, 11:20am

Possibly need size to be a tuple? You’re resizing your 750/1000 images as 375/500. I’d just set it as 384/512 and see if that works. I’m a lazy programmer. From size = src_size//2 to size = src_size*.512

hud · January 18, 2019, 2:18am

Yeah I get the same error when I change the size=128 or size=src_size*.512* as you have proposed.

What else could be the reason for the CUDA memory problem, not sure how to troubleshoot this

hurleyoe · January 20, 2019, 5:31pm

I am also getting a similar error. From reading through the forums I saw some suggestion it could be due to the mask tensor containing 0 and 255 values instead of 0 and 1 (which is the case with my example. Using fastai for Segmentation, receiving a CUDA device-side assertion error

Although the suggestion to set div=True doesn’t help me much, as I’m not sure what way to change it. The advice here didn’t work ImageMask.data created by open_mask returns all zeros as SegmentationItemList does not have attribute set_attr()

sgugger · January 20, 2019, 7:10pm

Now the current method is to subclass SegmentationLabelList and its open function (return open_mask(fn, div=True)). Then pass your new class with label_cls when you label your data using the data block API.

hurleyoe · January 20, 2019, 11:53pm

Thanks. Sorry to be asking something that is possibly obvious, but where exactly do you pass the label_cls - I can’t find any examples, I’ve read through the documentation and I’m not sure if what I’m doing is working - I can get the data to show batches of images, but anytime I try the learning rate finder I get the ame CUDA assertion error, so I’m assuming I’m still doing something wrong. I’m passing it in the SegmentationDataList, e.g.

src = (SegmentationItemList.from_folder(PATH_PNG, label_cls = SegmentationLabelList2)…

sgugger · January 21, 2019, 12:41am

It’s when you label that you want to pass label_cls (the function label_from_something).

hud · January 23, 2019, 4:31am

I’m so sorry I still dont understand this …

I first define the path to my images, split them and label them using a function.

src = (SegmentationItemList.from_folder(path_img)
.random_split_by_pct()
.label_from_func(get_y_fn, classes=codes))

src gives

LabelLists;

Train: LabelList
y: SegmentationLabelList (495 items)
[ImageSegment (1, 3024, 4032), ImageSegment (1, 750, 1000), ImageSegment (1, 780, 1040), ImageSegment (1, 768, 1024), ImageSegment (1, 3024, 4032)]…
Path: /home/jupyter/.fastai/data/Longkang/images
x: SegmentationItemList (495 items)
[Image (3, 3024, 4032), Image (3, 750, 1000), Image (3, 780, 1040), Image (3, 768, 1024), Image (3, 3024, 4032)]…
Path: /home/jupyter/.fastai/data/Longkang/images;

Valid: LabelList
y: SegmentationLabelList (123 items)
[ImageSegment (1, 768, 1024), ImageSegment (1, 3024, 4032), ImageSegment (1, 3024, 4032), ImageSegment (1, 750, 1000), ImageSegment (1, 3024, 4032)]…
Path: /home/jupyter/.fastai/data/Longkang/images
x: SegmentationItemList (123 items)
[Image (3, 768, 1024), Image (3, 3024, 4032), Image (3, 3024, 4032), Image (3, 750, 1000), Image (3, 3024, 4032)]…
Path: /home/jupyter/.fastai/data/Longkang/images;

Test: None

In my datablock API I run

data = (src.transform(get_transforms(), size=size, tfm_y=True)
.databunch(bs=bs)
.normalize(imagenet_stats))

data.show_batch works with mask overlay, but the learning part receives CUDA error.

Isn’t passing classes=code equivalent to passing label_cls=MySegList where I was labelling? Or do I have to pass label_cls again (somehow?) in datablock API ?

A short example would help a lot!

tcapelle · February 1, 2019, 6:31pm

I am exactly there, with the same problem.

MohamedELshazly · February 19, 2019, 1:47pm

Did u manage to solve it ?

haverstind · May 22, 2019, 7:57pm

just wanted to mention that this little hint really helped

faib · July 22, 2019, 3:06pm

I am trying to use Object Detection with RetinaNet on my own dataset. I was able to train it using the UFPR-AMR dataset. However, when I try to use my own I get a similiar error as @dhoa when trying to do learn.lr_find().

RuntimeError: invalid argument 1: tensor must have one dimension at /opt/conda/conda-bld/pytorch_1550796191843/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:572

What am I missing here? I already removed image augmentation but the error persists…

dhoa · July 23, 2019, 6:53am

This post can help you solve the problem ? RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 0 in dimension 1

Even when you don’t use data augmentation, there still have the basic transformation as crop or resize your image to some fix dimensions. Sometimes, it can throw your point out and the error appear there, I guess