A small bug of lr_find

hhy · February 1, 2018, 11:13am

I just run the lesson1.ipynb, and I set the batchsize to 2,then when I run the lrf=learn.lr_find(). The tqdm play wrong. The code finish when the tqdm end with 44%. I don’t read the source code of fastai. Maybe the param total in tqdm is set in a wrong way?

And I can not run the learn.fit when I set the batchsize to 1. When I run the follow code. It will raise a error:

learn.fit(5e-4, 2)

C:\ProgramData\Anaconda2\envs\fastai\lib\site-packages\torch\nn\functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
1009 size = list(input.size())
1010 if reduce(mul, size[2:], size[0]) == 1:
-> 1011 raise ValueError(‘Expected more than 1 value per channel when training, got input size {}’.format(size))
1012 f = torch._C._functions.BatchNorm(running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled)
1013 return f(input, weight, bias)

ValueError: Expected more than 1 value per channel when training, got input size [1, 1024]

THX

gdc · February 2, 2018, 1:44am

Hi,
i ran into this one too, it has to do with batch normalization which can’t run on batches of size 1.
basically batchnorm involves dividing by standard deviation of the batch (+ something small) and on a batch with 1 sample, standard deviation is 0.

you can find more information there:

so setting batch size to 1 can not work. also there’s a little catch: if your total number of training samples, modulo batch_size, is 1, then your last batch is of size 1 and you will get this error. so you have to take this in account when choosing a batch size (or remove 1 training sample).

concerning tqdm, not sure what you mean though ?

hhy · February 2, 2018, 3:19am

THX. The corner case you mentioned is interesting. I haven’t found these situation when I use tf.
And the tqdm display bug is when I run the code here:

arch=resnet34
data = ImageClassifierData.from_paths(PATH,bs=2, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
lrf=learn.lr_find()

When the code finish, the tqdm will show:
47%|███████████████████████████████▏ | 5441/11500 [02:56<03:16, 30.88it/s, loss=1.61]

So I think it may be wrong in the module when it set the total in tqdm.