Retina net notebook merge idx issue

Hi all,

I am working on the Retina net notebook. Here is one line that seems to me off the focal loss paper.
I am not sure it is a bug or my understanding got wrong.

self.merges = nn.ModuleList([LateralUpsampleMerge(chs, sfs_szs[idx][1], hook) 
                                     for idx,hook in zip(sfs_idxs[-2:-4:-1], self.sfs[-2:-4:-1])])

sfs_idx is a list with [6,5,4,2], which correspond to Resnet 50 model’s grid size change layer.

My understanding is,

layer idx 6, correspond to C4, which is the layer with 16 by 16 by 1024
layer idx 5, correspond to C3, which is the layer with 32 by 32 by 512
layer idx 4, correspond to C2, which is the layer with 64 by 64 by 256
layer idx 2, correspond to C1, which is the layer with 128 by 128 by by 64

therefore if we zip the idx and hook, in idx [-2:-4:-1] we actually have idx [4,5]. Which in the upsampling part gives hook output of C2 and C3. (P2 = P3+C2, P3 = P4 + C3)

I am a bit confused, I think the slicing should [0:2:1], which should be [6,5]
Where layer idx 6, is C4, P4 = P5 + C4
layer idx 5, is C3, P3 = P4 + C3

As we know from the paper, we are capturing the feature map level from P3-P7.

2 Likes

Actually, this really looks like a mistake. I tried to check your assumptions on practice.
You are right about sfs_idxs == [6,5,4,2] and that slice sfx_idxs[-2:-4:-1] == [4,5].

I added some debug output to forward method of LateralUpsampleMerge:

    def forward(self, x):
        conv_lat_hook = self.conv_lat(self.hook.stored)
        print("conv_lat_hook.shape:", conv_lat_hook.shape, "+ x.shape:", x.shape)
        return conv_lat_hook + F.interpolate(x, self.hook.stored.shape[-2:], mode='nearest')

And when I run learn.summary() for learner with 256x256 images in data, these are first lines of output:

conv_lat_hook.shape: torch.Size([1, 256, 64, 64]) + x.shape: torch.Size([1, 256, 8, 8])
conv_lat_hook.shape: torch.Size([1, 256, 32, 32]) + x.shape: torch.Size([1, 256, 64, 64])

I tried changing the slice from [-2:-4:-1] to [0:2:1] and got this:

conv_lat_hook.shape: torch.Size([1, 256, 16, 16]) + x.shape: torch.Size([1, 256, 8, 8])
conv_lat_hook.shape: torch.Size([1, 256, 32, 32]) + x.shape: torch.Size([1, 256, 16, 16])

That’s better. It seems like author of the code forgot that he has reversed list of encoder’s layers, which change size of image, in sfs_idxs already.

Looking forward to hearing from the authors.

Maybe we are wrong and that “mistake” was made on purpose and gives better results.
I can’t check that, still can’t make the notebook working: Having problems running pascal.ipynb notebook

I have both of the SSD and retina net working notebooks, if you want to take a look :slight_smile:

I forgot to update this post after I finished the retina net.

Anyway, here is the link:

https://github.com/heye0507/dl_related/blob/master/play_ground/Retina_net_dev.ipynb

1 Like

First, to convert these n values to probabilities! we apply the softmax activation function to them.

Thank you!
I looked at your RetinaNet, but it seems there is loss function from SSD, not focal loss. Can you explain this, please?

I can’t understand how this is related to the topic. Could you give some more context?

My bad of laziness, it is focal loss though. I was actually doing retina net and focal loss for a interview, cycled as much code as possible from SSD. It is also true that focal loss is not different vs SSD loss, they only different is the get_weight() call.

I should rename it to focal loss, instead of adding an option said focal_loss = True

And did you do any tweaks to the model, except from fixing slicing and refactoring?

I didn’t use the latest bounding box introduced in the 2019 retina net notebook. I’m still using the scale of 2018 bounding boxes, changed the coordinates system from 0-1 to -1 to 1 (you can check my scale, I was rushing the result, not 100% sure)

I didn’t implement the extra conv for smoothing out the upsampling artifacts.

Last, I didn’t have time to implement non-max suppression and outputting confidence probablity on the graph.

That’s all the difference I can think of. Oh, I didn’t change the bias part, when first time training from scratch, the model will need more epochs to adjust the initial weight.

But you can find all of what I just said in 2019 retina notebook.

Hope this helps :slight_smile:

Thank you a lot!

Currently I am still fighting with pascal.ipynb notebook and have some progress: Having problems running pascal.ipynb notebook

I finally fixed pascal.ipynb notebook and made some testing.

The results show that model with original slicing [-2:-4:-1] gives worse results than with [0:2] (which seems to be correct according to the RetinaNet paper).
And also the model with original slicing works much slower.

Here are the losses:

Fit on 128, model freezed:

Slicing Final train loss Final valid loss
[-2:-4:-1] 1.469326 1.805385
[0:2] 1.170873 1.387025

Then fit on 128, model unfreezed:

Slicing Final train loss Final valid loss
[-2:-4:-1] 1.043817 1.393275
[0:2] 0.850711 1.052559

Then fit on 192, model freezed:

Slicing Final train loss Final valid loss
[-2:-4:-1] 1.022913 1.303066
[0:2] 0.861015 1.055111

Then fit on 192, model unfreezed:

Slicing Final train loss Final valid loss
[-2:-4:-1] 0.757558 1.097417
[0:2] 0.688136 0.899993

Then fit on 256, model freezed:

Slicing Final train loss Final valid loss
[-2:-4:-1] 0.767718 1.063900
[0:2] 0.732141 0.907611

Then fit on 256, model unfreezed:

Slicing Final train loss Final valid loss
[-2:-4:-1] 0.619473 0.929040
[0:2] 0.564977 0.825705

Fixed model with slicing [0:2] gives better results every time.

Here are notebooks with all results:

Right now I am creating PR with fixes to pascal.ipynb notebook in course-v3 repository. Then I plan to create PR to fastai-dev repository.

P.S. I should mention that I’m currently testing on GTX 860M, so I have to reduce size of the batches a lot (make them 8 times smaller) and wait for a lot of time. In a week I’ll come home and will make more tests using RTX 2070.

1 Like

The thing I didn’t submit the PR when I noticed the problem is you want to visualize the feature map output. According to the paper, the implementation on the fastai repo is off, however, I am not 100% understand the bounding box implementation.

As you could expect, if the feature map is wrong, the size of receptive field should be much smaller than the one expected in the p3p4 layers(for example, instead of 3232, they probably output 6464). Which according to the paper, they should even have a better result: if you ever plot the pascal images, the missed ones are those very small things.

Loss is not the good metric here, you will need to calculate the mean average precision mAP, compare the fastai one, and compare yours to tell if indeed improved the result.

Also, since they are calculating more smaller receptive fields, it makes sense that your loss is lower.

@radek if you don’t mind I’m adding you to this. I think you probably went over the retina net implementation. Do you think the merge idx is wrong?

The discussion is, in the fastai repo, retina net implementation is outputting C2-C3 instead of C3-C4 because of the different slicing (my first post)

Thanks in advance

Sorry haven’t looked at this part of the codebase in ages.

It’s all cool :slight_smile: I will just wait until Jeremy gets the object detection for the supplement course materials.

Thanks any way :slight_smile:

If the code, that follows training, is correct in the notebook, then I did calculate mAP already.

You may look at the endings of this notebooks:

But I haven’t studied and verified metrics calculation code yet.

BTW, my PR: https://github.com/fastai/course-v3/pull/415

I also noticed, that data normalization was forgotten in original pascal notebook.

But I see it here.

Added data normalization to my PR. Will run more tests soon and post the results.

Does anyone know state-of-the-art mAP for pascal 2007 dataset? I found some numbers that are a lot higher than my results, but do not know if train/validation datasets were splitted the same way there.

I thought pascal dataset has a validation set when you download the data. valid.json is the annotation for validation set I think.

I googled it. It seems to me the SOTA is around 87% mAP.

Also, if you read the retina paper, I think the input image size is 512. If you want to reach the mAP they had, I think you will need to train things with similar image size (your receptive field will be different as you see when I grow 128 to 256)

I used a P100 at the time, you probably can push it with V100 with mix precision.

Results are almost the same. Here is the notebook.

@heye0507, did you get state-of-the-art mAP for any well known dataset using your version of model? and loss function?

1 Like