Project: implement yolo v3 backbone and preact resnet

jeremy · April 9, 2018, 12:54am

I was thinking of this one, since it just reads in the config file:

jeremy · April 9, 2018, 1:04am

@sgugger @emilmelnikov I’ve reformatted the yolo v3 config file (without the localization bit) to make it more easy to read and see what’s going on. Hopefully this helps you check your implementation against the original:

gist.github.com

https://gist.github.com/jph00/f48a68b86e1a9bcc07cac23c20a7c51e

yolov3 config.txt

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky

# Downsample
[convolutional] batch_normalize=1 filters=64 size=3 stride=2 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=32 size=1 stride=1 pad=1 activation=leaky
[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky
[shortcut] from=-3 activation=linear

# Downsample
[convolutional] batch_normalize=1 filters=128 size=3 stride=2 pad=1 activation=leaky

This file has been truncated. show original

sgugger · April 9, 2018, 1:52am

Thanks Jeremy, this is very helpful.

I changed the notebook accordingly and I think it’s the real darknet-53 now. Trained it for a few epochs and it seemed to work.

jeremy · April 9, 2018, 3:50am

OK that’s great @sgugger I’ll try it out on full imagenet tomorrow - please pop it in the models/ dir and submit a PR.

@emilmelnikov if you have a moment maybe you can compare it to your implementation (speed and accuracy and code)? Let us know if you think there’s room to improve!

jeremy · April 9, 2018, 3:52am

Well after studying our existing resnet code it appears the stride-2 conv has already been moved to the spot they suggest…

bushaev · April 9, 2018, 1:49pm

Hi!
I’ve checked out your implementation. Looks very cool.
One thing though, In your ConvBN you have padding as parameter but when creating convolutional layer you always set padding to 1(which seems like a correct behavior, because yolo3 config also always use pad=1).

And in DarknetBlock you have:

+        self.conv1 = ConvBN(ch_in, ch_hid, kernel_size=1, stride=1, padding=0)
+        self.conv2 = ConvBN(ch_hid, ch_in, kernel_size=3, stride=1, padding=1)

so you’re trying to set padding to zero in first conv layer but the parameter later gets ignored. (Which again, I think it should be, as this config from yolo3 suggest).

sgugger · April 9, 2018, 2:13pm

You’re absolutely right. I forgot to remove all those paddings after setting to one every time. It doesn’t change anything after but it’s way cleaner this way!
Thanks for catching this, I’ve updated the notebook with the correction.

bushaev · April 9, 2018, 3:47pm

Hi, again! Ifter looking at darknet source code I’ve realized that ‘pad’ and ‘padding’ aren’t the same thing.

As you can see from here, when pad is equal to 1, then padding is set to (filter_size / 2). So padding=0 whould be the correct way for convolutional layers with kernel_size=1. Which makes much more sense, since why whould we need a padding for convolution which only sees one cell at a time and hence doen’t change shape of the actications from previous layer.

sgugger · April 9, 2018, 4:10pm

You’re right again, I went to fast while putting that padding to 1 for every layer (which only worked because of another bug anyway: I always forced the kernel_size to 3 in BNConv…)

I’ve updated again the notebook and my PR, thanks a lot for catching all of this!

emilmelnikov · April 9, 2018, 4:58pm

Sorry, was a bit busy, and now I’m having strange problems with CUDA OOM errors. This is the model itself: https://github.com/emilmelnikov/darknet53-pytorch

I’ve also tried to write training code using approximately the same settings as in the yolov3.cfg: SGD with learning rate 1e-3, momentum 0.9, weight decay 5e-4. Apparently, there are also some settings for data augmentation (angle, saturation, exposure, hue), but I didn’t succeed to find out exactly how DarkNet framework does it.

A couple of interesting notes:

Of course, feel free to steal anything you find useful from the code.

sgugger · April 9, 2018, 5:06pm

Oh, I hadn’t noticed this one.
Thanks for sharing your code, will compare our approaches!

Edit: If I’m not mistaken, there is a bug in your shortcuts: the leaky_relu is applied after summing the input and the output of the second conv.

emilmelnikov · April 9, 2018, 5:40pm

In the original ResNet it is true, but in darknet it seems to be different.

The shortcut layer is defined as follows:

[shortcut]
from=-3
activation=linear

It looks like shortcut_cpu is a linear combination of it’s inputs.

The linear activation is just an identity function.

sgugger · April 9, 2018, 5:44pm

Yeah but in the implementation in pytorch they write after:

elif block['type'] == 'shortcut':
    from_layer = int(block['from'])
    activation = block['activation']
    from_layer = from_layer if from_layer &gt; 0 else from_layer + ind
    x1 = outputs[from_layer]
    x2 = outputs[ind-1]
    x  = x1 + x2
    if activation == 'leaky':
        x = F.leaky_relu(x, 0.1, inplace=True)
    elif activation == 'relu':
        x = F.relu(x, inplace=True)
    outputs[ind] = x

Edit: which means you’re right since with shortcut, we’ve got activation=linear, which isn’t leaky or relu.
Sorry!

jeremy · April 9, 2018, 8:38pm

Just FYI the implementation in pytorch isn’t from the original authors, and in my experience about 98% of attempts to replicate papers in DL are wrong. So take them with a grain of salt, unless they show that they’ve replicated the results from the paper!

jeremy · April 9, 2018, 8:40pm

Just read through the PR and the implementation is very nice and clean - thanks! Will try it out now.

sgugger · April 9, 2018, 8:43pm

Like my four or five first attempts to replicate this darknet
I think I got it right now, all thanks to emilmelnikov. The last version in the PR seemd to be training properly:

jeremy · April 9, 2018, 9:09pm

FYI I just pushed a minor fix, which is to remove the log_softmax. In pytorch models generally don’t include that, since it’s built in to the cross_entropy loss function.

I also added a couple of smaller versions of the model for us to try out.

jeremy · April 9, 2018, 9:26pm

Bad news: I can’t fit batch size 128 on a 16GB GPU. Any thoughts on how to decrease memory needs?

I’ve created a copy of smaller versions of the model to try. But the RAM use seems very high. Have you guys tried counting the number of parameters and comparing that with the reference implementation to ensure it’s the same?

jeremy · April 9, 2018, 10:24pm

OK I’ve just pushed a version of darknet.py that allows different numbers of groups, and has a variety of ‘mini’ versions that all fit in 16GB when using half precision and batch size of 128. They should also fit in 8GB with single precision with batch size 32.

sgugger · April 9, 2018, 10:31pm

A brief sum on my implementation gives 41,609,928 parameters. Not sure how many there are in the original implementation.