Part 2 Lesson 9 wiki

As I think more about this approach, this flexibility of dealing with varying tensor sizes across mini-batches seems to be one of the upsides of frameworks like PyTorch which use dynamic computation graphs in comparison to frameworks like TensorFlow etc. Can we achieve similar elegance with TF? Spare my ignorance.

1 Like

Seems that whether we use one side padded convolutions at some point and what their exact parameters are is relatively unimportant vs the big picture.

Forget the architecture, it’s just a thing that is spitting out 16 x (4 + c) activations (~ 1h:20m)

Seems the magic lives in the cost function and despite the arch being fancy it is still ‘just’ a universal function approximator :slight_smile:

I think I was blowing the importance of a minor detail way out of proportion then.

Thank for @daveluo to unpack the SSD_MultiHead and reconcile the output shape.

Edit: See corrections and more detail explanation below:

9 Likes

Seriously, great job, helped a ton! :sunglasses:

Getting good understanding of each layer, especially in places of intersection between two models (in this case how convolution proceeds after base model renet34 to calculating the final layers to match output activations), how they are put together is very important. Actually I think this gives us an opportunity to get into the head of these model developers.
You are doing a good job by being curious, and helping us around.

Thanks for the photo share! It was really helpful and great learning for me too to work it through with you all.

A correction to make at the bottom-left of the photo where it says “Output = (64, 189, 4)”:

  • 64 is the batch size, not channels
  • 189 is the number of predictions for each of the 64 images in the batch. This represents/corresponds to the 189 anchor boxes that we defined up top.
  • 4 is the set of bounding box corners that is trained to define each anchor box (x 189 from the 2nd dimension). This is the 1st of 2 outputs in a list (specifically, torch.cat([o1l,o2l,o3l], dim=1)])

The second output has 21 elements in the 3rd dimension - full shape would be (64,189, 21) - representing the one-hot encoded predictions for the 20 categories + 1 ‘bg’ category. This is torch.cat([o1c,o2c,o3c], dim=1)

from the return step of the forward pass:

class SSD_MultiHead(nn.Module):
    def __init__(self, k, bias):
        super().__init__()
        self.drop = nn.Dropout(drop)
        self.sconv1 = StdConv(512,256, drop=drop)
        self.sconv2 = StdConv(256,256, drop=drop)
        self.sconv3 = StdConv(256,256, drop=drop)
        self.out0 = OutConv(k, 256, bias)
        self.out1 = OutConv(k, 256, bias)
        self.out2 = OutConv(k, 256, bias)
        self.out3 = OutConv(k, 256, bias)

    def forward(self, x):
        x = self.drop(F.relu(x))
        x = self.sconv1(x)
        o1c,o1l = self.out1(x)
        x = self.sconv2(x)
        o2c,o2l = self.out2(x) 
        x = self.sconv3(x)
        o3c,o3l = self.out3(x)
        return [torch.cat([o1c,o2c,o3c], dim=1),
                torch.cat([o1l,o2l,o3l], dim=1)]
7 Likes

Going back to the initialization of the bias on the output convolutional layer that gives us the predictions of classes (-3 when we have 16 anchors, then -4 when we have a lot of anchors), Jeremy has stated that he put a negative value to make it harder for our the network to predict a category (to make it easier to predict background, which happens a lot of time).

I’ve played around with this in that notebook and I’ve come to several conclusions.

  1. It’s super helpful to do this indeed, because even with a lot of training, a network where we initialize the bias with zeros doesn’t train as well, and doesn’t manage to come to the same results in terms of loss.

  2. Contrary to what I thought at first, the network won’t learn on his own to put a strong negative value in those bias. In fact, they barely change during the training.

  3. The best value for initialization (in the sense that it gives the lower validation loss after a cycle) I found by trying a lot of them is -6 for the first model with only 16 anchors.

Then I tried to find a way to guess what the ideal value would be for this initialization without trying a lot of them and doing a cycle. My idea was to try all the bias in a given range on a model and test them on the first mini-batch, then compute the loss. Then I’d initialize the bias to the value that gave the minimum loss.
In the first model with 16 anchors, this gave me -5.45, not as good as the empirical -6, but close enough.

On the last model with all the anchors and the focal loss, it gave me -3 (close to the -4 Jeremy picked) which then gave similar results after a lot of training (final validation losses of 5.523 and 5.497 respectively).

9 Likes

I got this error to. The line

anchors = anchors.cpu(); grid_sizes = grid_sizes.cpu(); anchor_cnr = anchor_cnr.cpu()

happens after the error about weight type and input type.

I found .cpu() in a number of places:

def one_hot_embedding(labels, num_classes):
    return torch.eye(num_classes)[labels.data.cuda()]

for i,o in enumerate(y): y[i] = o.cpu()
learn.model.cpu()

I changed it and now do not get the runtime error about CUDAFloatTensor vs. CPUFloatTensor.

Now I get the error

Performing basic indexing on a tensor and encountered an error indexing dim 0 
with an object of type torch.cuda.LongTensor. The only supported types are integers,
slices, numpy scalars, or if indexing with a torch.LongTensor or torch.
ByteTensor only a single Tensor may be passed.

I did a little looking into it, but I have to go right now. More later.

I’ve found that if you just don’t run the lines that set pytorch Variables to .cpu() (or make sure those lines are commented out) in the original pascal-multi nb, it should all run correctly on GPU. Specifically these 3 cells:

x,y = next(iter(md.val_dl))
# x,y = V(x).cpu(),V(y)
x,y = V(x),V(y)
#for i,o in enumerate(y): y[i] = o.cpu()
learn.model#.cpu()
#anchors = anchors.cpu(); grid_sizes = grid_sizes.cpu(); anchor_cnr = anchor_cnr.cpu()

I found it easiest to restart the original notebook kernel, check that these 3 lines are commented out, and run through to confirm that it works.

By default, I believe variables are placed on CUDA (GPU) when they’re first defined. What’s happening is when you run the lines above (the ones I’ve commented out), it’s placing those pytorch Variables on cpu while other Variables are not and this makes them inaccessible to each other later on when a function requiring both is called.

6 Likes

This small tweak worked for me as well.

1 Like

The lines that convert existing tensors into cpu versions aren’t meant to be run - they are there to enable testing on the CPU (since if you have errors on the GPU, they’re much harder to debug).

1 Like

Yup exactly. This is done by fastai. You can override this behavior with fastai.core.USE_GPU=False BTW. (You need to run that before you start creating your models or dataloaders).

4 Likes

How Confident are we?

The confidence threshold hard coded into show_nmf is 0.25, I got some interesting results by making that a parameter. It seems for some objects (person) increasing this helps, and for others (dog) this hurts. In the SSD paper they quoted a threshold of .1, but I’m guessing that this theshold should somehow rely on the relative sizes of the gt object with regard to the anchor boxes. Any thoughts?

Advice: if you want to go through pascal-multi.ipynb step-by-step, executing as you go, use Tim David Lee’s version, not Jeremy’s. TDL’s version has more comments/discussions, disambiguations, and it actually runs straight through without having to edit to deal with CPU/CUDA.

I was reading about Focal Loss that was discussed in class. So, it handles class imbalance problem for single stage object detector like YOLO/ SSD by weighting the observations which were difficult to classify. Is it right to think of this as neural net version of ensemble of boosted trees? If so, it would be beautiful simple tweak that people didn’t think of before for CNN, but everyone was using it for tree models. Please correct me if I am wrong.

val_ds2 = ConcatLblDataset(md.val_ds, val_mcs)

In this line, I couldn’t figure out how md.val_ds and val_mcs could line up, since val_mcs is split at random by:

((val_mcs,trn_mcs),) = split_by_idx(val_idxs, mcs)

and md.val_ds comes from ImageDataClassifier.from_csv() creating a different validation set at random.

Well, it turns out that the validation sets are not random, exactly. They have a default seed, so if you don’t specify the seed, the validation sets are chosen for the same records.

Hope this saves someone else some time.

1 Like

For

def one_hot_embedding(labels, num_classes):
    return torch.eye(num_classes)[labels.data.cpu()]

why is labels.data put on the CPU?

Because pytorch doesn’t like it otherwise, it’s the error you quoted earlier:

Performing basic indexing on a tensor and encountered an error indexing dim 0 
with an object of type torch.cuda.LongTensor. The only supported types are integers,
slices, numpy scalars, or if indexing with a torch.LongTensor or torch.
ByteTensor only a single Tensor may be passed.

Pytorch doesn’t want an indexing by a cuda tensor, only integers, slices, numpy scalars or torch.LongTensor/ByteTensor.
labels.data is a toch.cuda.LongTensor because it’s stored on the GPU during training, so we have to pass it back to the CPU to convert it into a toch.LongTensor.

…and I have no idea why not. It looks like a bug to me, or at least a missing feature. I see no reason why pytorch shouldn’t support indexing with a cuda tensor.