Understanding code - error Expected more than 1 value per channel when training

alessa · December 29, 2017, 1:49pm

I stopped using AWS and moved to my personal laptop, since I am only doing tests for understanding the code.
I struggled with this error for a while Expected more than 1 value per channel when training, got input size [1, 1024] and I tried to figure out what am I doing wrong. I have seen that this error is displayed by the forward function which calls the F.batch_norm. And interpret it as the fact that he aspects and image [1 3 224 224] for example or [1, 256, 14, 14] where 256 is the number of filters and 14x14 is the size of the input/image. So it does well until the input is flatten - transformed into a vector.

size(input): [1, 64, 112, 112]
size(input): [1, 64, 56, 56]
.....
size(input): [1, 512, 7, 7]
size(input): [1, 512, 7, 7]
size(input): [1, 1024]

Trying to figure out what’s the issue, I replicate the same content on the AWS - where I have no error - running the exact same code. Any help, or orientation tip is extremely appreciated

On the left is the code from aws, and to the right is the same code on my pc + the error

p.s. Even the model looks the same

lgvaz · December 29, 2017, 6:07pm

Since the same code is running with no problems on AWS there’s a chance that you need to update pytorch?

jeremy · December 30, 2017, 11:28pm

Or else that the data is different on the two computers?

Tsepaka · February 4, 2018, 10:16pm

I’m having the same strange issue. What I’ve figured out so far - it probably has something to do with dataset size and batch size (however I haven’t still figured out all the details).

When I train on a sample dataset with ~800 images - it does not reproduce with batch size 32, but it does reproduce with batch size 1.

When I train on a full dataset of ~100K images - it does reproduce with batch size 32, but (if I recall correctly) does not reproduce with batch size 64.

So, I would suggest trying few different batch sizes to mitigate the issue, until the root cause is figured out and fixed.

Upd. I’ve just noticed you asked the question more than a month ago. @alessa Did you figure out what’s the issue?

alessa · February 5, 2018, 5:42pm

import torch
print(torch.version)
0.2.0.4

alessa · February 5, 2018, 5:42pm

exactly the same data

alessa · February 5, 2018, 5:46pm

I had let it go, cause it was too abstract, and I had no clue how to fix the issue.

I come back to it, due to your post, and I realized that whenever I set precompute=True I have the same error both on aws and on my pc which sais running mean should contain 3 elements not 4096

[update] Jeremy is saying what’s the reason of this error [here](How do we use our model against a specific image? - #25 by jeremy

You need to said precompute=False before you do that prediction, since you’re passing in an image, not a precomputed activation.

alessa · February 5, 2018, 5:48pm

but when I set no precompute then I have the same error only on my local machine

ValueError: Expected more than 1 value per channel when training, got input size [1, 4096]

alessa · February 5, 2018, 6:45pm

It has no relationship with the batch size, I modified it several times and nothing changed.

But I figured out that if this lines of code give me that error "ValueError: Expected more than 1 value per channel when training, got input size [1, 4096]"

data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs)
learn = ConvLearner.pretrained(arch, data)

x,y = next(iter(data.val_dl))
x,y = x[None,0], y[None,0]
m  = learn.models.model
py = m(Variable(x.cuda())); py

as soon as I call learn.predict I have no error anymore

data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs)
learn = ConvLearner.pretrained(arch, data)

learn.predict()

x,y = next(iter(data.val_dl))
x,y = x[None,0], y[None,0]
m  = learn.models.model
py = m(Variable(x.cuda())); py

Variable containing:
-0.9635 -2.2739 -0.6625
[torch.cuda.FloatTensor of size 1x3 (GPU 0)]

Tsepaka · February 5, 2018, 8:38pm

Ok, I’ve finally figured out what causes the issue.

What stated here is you will get this error each time size of one of your batches is equal to 1.
It has to do something with inner workings of a BatchNorm layer training (inference with a batch size of 1 is possible, according to the guy, who closed the issue).

So, the error occurs either if your batch size equals to 1, or if size of your dataset modulo batch size is equal to 1, causing the last batch of your data to contain a single element. The simple solution is just to remove one data point from your training dataset.

Hope it helps someone

pal23232 · March 14, 2018, 8:05pm

@alessa - i am having the same experience. Running learn.predict() somehow eliminates this mysterious error.

TheShadow29 · June 28, 2018, 6:39am

@alessa Doing model.eval() before the run solves the problem as you are turning off the batchnorm.

nadiarom · August 1, 2018, 5:44pm

Thank you for the detailed answer! Removing 1 item from dataset eliminated an error:)

annahaz · August 3, 2018, 8:16pm

thank you, it works!

Mihar · August 26, 2018, 1:01pm

Changing the batch size solves it too, e.g. bs = bs -1

nadiarom · September 15, 2018, 9:14pm

Yes. However, for some hardware related reasons, most people use batch size divisible by 8, e.g. batch size of 31 is inefficient. I am not an expert but have never seen an odd number of batches

bhollan · November 1, 2018, 10:59pm

I’m not sure the formula is still correct about the relationship to batch size and dataset size. I had 3 classes, each of which had an even number of images and it was giving me this error. I removed 1 image from each class and the error went away though, so I’m SUPER thankful for this thread! I was getting this same error with both resnet30 and resnet50 and the error went away for both, so I don’t think it’s model related, but if you’re struggling with this error, just try tossing 1 training image out (per class?) and see if it goes away.

Thanks to all!

erdi · August 17, 2019, 1:59am

as pointed out by
@Tsepaka

batch_size=1

contradicts with the batch normalization in training mode. So it seems it cant be used in training mode unless they fixed the bug. But for testing

model.eval()

before the evaluation solves the problem.