Image regression, several points

Good afternoon. I tried to create a model to predict the coordinates of items on pictures, as seen in the lesson 3 Regression with BIWI head pose.
In the end I got the following error when I want to create my dataset in a similar way :

It’s not possible to collate samples of your dataset together in a batch.
Shapes of the inputs/targets:
[[torch.Size([3, 160, 160]), torch.Size([3, 160, 160])], [torch.Size([152, 2]), torch.Size([1, 2])]]

Each pictures can have different number of items/targets to detect (here there has 152 coordinates x/y to predict for one image vs an image with only one item). Do I am trying to do something impossible ? ^^ Or this a specific part of the doc I should look at ? (or maybe change the dimension of my tensor ? something like 152 x 1 x 2 ?).

Thanks !

1 Like

How are you creating your databunch? How are you gathering the labels? Could you provide that code please? Those help immensely as I dealt with the same issue a few months back with the Coco dataset :slight_smile:

I am creating data in the following way :

data = (PointsItemList.from_df(df_train, path+"/train_images/")
.transform(tfm_y=True, size=(160,160))

If I put bs = 1 it works. The problem is really about the dimension of the target/output.
Thanks for your help ^^

When you do bs of 1 your points all come out correctly? Aka show_batch() looks okay?

Yes ^^

I have the same issues as you guys. Did you managed to solve it ? Or do I need to have the same number of points for each image ?

Hi !
Well, actually I was a bit induced in error by my partial view of the library, I moved to something else ^^ (I try to use retina net, not implemented in the current fast. ai, but there is a Github repo for it ^^)

From what I’ve worked with image points so far is you need some form of way to say you have x possibilities. Eg if it’s not there default to 0,0 from what I’ve worked with when I tried human pose data. I’m unsure if there’s a way to just say there’s X points instead due to our Y is constantly changing now instead of an expected outcome. So perhaps try setting the rest to (0,0) if not there.

Hello ! Thank you for the tips, I will look into it !

I had the same problem. I fixed it by making all my input images of the same size.

Why does not anything else beside bs =1 work? I do not understand. Please many you explain?
I have literally the same issue, now its fixed. But I don’t understand why.
It is bitter sweet

1 Like

I kinda have the same issue, for me it seems to do with the transforms. If i have transforms enabled and the transforms would result in some of the points dropping out of the image region, i get an error message when trying to run training or check the learning rate. Thats my theory atleast.

like so:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 4 and 0 in dimension 1 at /tmp/pip-req-build-p5q91txh/aten/src/TH/generic/THTensor.cpp:689

Rather shame since the transformations are great, i’ve just started looking into fastai and have not yet figured out a way to plug into this in a way that i could just use the original image with the points if the transform results in invalid data.

My data has 4 points for each image, the data is all there and the training will work without the transforms enabled. I’m not getting really good results through, on the surface it would not look like too hard a problem. I’m using the resnet34 architecture like in the lesson, but i’m struggling to get bellow 10% accuracy on the predictions.

You’re right - the transforms likely make some points drop out of the image region. Try using remove_out=False inside .transform() to remove such samples.


Hi, I was wondering if anyone knows if it’s possible to have a variable number of outputs without defining an upper limit and padding the unused coordinates?

I have a use case where an image could have zero points, up to some unknown limit.
For any image in any use case the upper limit will be its number of pixels, but such a model is then equivalent to semantic segmentation.
My use case will be far below one point per pixel, but to guess at it I would have to guess the smallest representation of the feature.

I could use semantic segmentation / a prediction per pixel but on average the number of points will be a small fraction of this so would be nice to be able to work with this subset of data instead of generating larger segmentation masks.

1 Like