Having problems running pascal.ipynb notebook

tsar · September 22, 2019, 8:58pm

First of all, thank you very much for the wonderful course, @jeremy!
I studied 12 lessons currently. It really made me understand deep learning at some level, while other videos and articles did not help.

I found some notebooks that were not covered in the course-v3. For example, Object Detection notebook pascal.ipynb.

I already found the topic, where @muellerzr says that Object Detection will be a separate course.
But I thought, I can learn it by myself (with a little help of the community maybe). I really want to make a project with object detection to detect cards of game “Set”. And also I have some more ideas for deep learning projects in my mind.

So, I tried understanding (not up to the end yet) and running pascal.ipynb.
It fails at first lr_find():

 in _unpad(self, bbox_tgt, clas_tgt)
     21         print("clas_tgt: ", clas_tgt)
     22         print("self.pad_idx: ", self.pad_idx)
---> 23         i = torch.min(torch.nonzero(clas_tgt-self.pad_idx))
     24         return tlbr2cthw(bbox_tgt[i:]), clas_tgt[i:]-1+self.pad_idx
     25 

RuntimeError: invalid argument 1: cannot perform reduction function min on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/THC/generic/THCTensorMathReduce.cu:64

As you can see above, I created debug output for clas_tgt and pad_idx.

Found that it crashes when there are only zeros in clas_tgt.

clas_tgt:  tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
self.pad_idx:  0

And that is clear, why it fails: torch.nonzero returns empty tensor, torch.min can’t handle it.

I started thinking, how does it happen, that there is clas_tgt with zeros only sometimes. I’m not yet completely understanding the notebook, but I guessed that there are some images without bboxes coming here. So I decided to check the databunch and found that there are some images with no bboxes.

Here is my dirty way to check that fact

data = get_data(1,128)

i = 0
for smth in data.train_dl.dl:
    #print(smth[1][0].shape)
    num_objs = smth[1][0].shape[1]
    i += 1
    print(i, num_objs)
    if num_objs == 0:
        print(smth[0].shape)
        print(smth[0].squeeze(0).shape)
        show_image(smth[0].squeeze(0))
    assert(num_objs > 0)

I spent some more time and found that all original images have bboxes, but it may happen, that there are no bboxes on visible area of some image after this line:
src = src.transform(get_transforms(), size=size, tfm_y=True)

Even if I remove transforms: src = src.transform(size=size, tfm_y=True)
For example, this image just becomes cropped to square and loses bbox:

I decided to filter images that lose bboxes after transform. I wrote this line of code after transform:
src = src.filter_by_func(lambda x, y: len(y[0]) == 0)

And suddenly found a bug in fastai v1. get_data failed with this error on next line (creating databunch): AttributeError: 'ObjectCategoryList' object has no attribute 'pad_idx'
Even print(src) after filtering caused such error.

After a few hours of debugging, I found that this line of code in data_block.py loses y.pad_idx.

Proof

Monkey patching LabelList.filter_by_func:

def filter_by_func(self, func:Callable):
    filt = array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
    self.x = self.x[~filt]
    print('before: ', 'pad_idx' in vars(self.y))
    self.y = self.y[~filt]
    print('after:  ', 'pad_idx' in vars(self.y))
    return self

LabelList.filter_by_func = filter_by_func

Results:

before:  True
after:   False
before:  True
after:   False

I do not know how to fix this correctly. So I created a dirty monkey-patch fix for temporary usage:

Temporary fix

Monkey patch, only for object detection dataset.

def filter_by_func(self, func:Callable):
    filt = array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
    self.x = self.x[~filt]
    pad_idx = self.y.pad_idx  # save pad_idx
    self.y = self.y[~filt]
    self.y.pad_idx = pad_idx  # set pad_idx
    return self

LabelList.filter_by_func = filter_by_func

This helped me to run get_data without crashes.

But the filtering didn’t help, because it did not affect sizes of train and validation datasets. It seems, that removing of bboxes, that became invisible after transforms (bboxes, which became out of image), are done while creating databunch.

Right now I’m too tired of debugging and decided to share my experience, tell about the bug in fastai v1 and ask for some help.

Thank you!

cap_rogers · September 23, 2019, 4:06pm

Where did you find the notebook?

tsar · September 23, 2019, 4:31pm

Here: permalink.

Current state: I continued studying the notebook and somehow solved the crash problem. Here is my patched code of RetinaNetFocalLoss._unpad:

    def _unpad(self, bbox_tgt, clas_tgt):
        nonzero = torch.nonzero(clas_tgt-self.pad_idx)
        if nonzero.shape[0] > 0:
            i = torch.min(nonzero)
        else:
            i = clas_tgt.shape[0]
        return tlbr2cthw(bbox_tgt[i:]), clas_tgt[i:]-1+self.pad_idx

Details: this function expected list of classes, which may have some zeros at the start and no zeros further. For example: [0 0 0 15 6 9]. There are zeros because of batches. But it’s also possible, that there are only zeros. So I created a workaround to handle this case.

Also I became alerted about how the model is built after reading Retina net notebook merge idx issue
I plan to try both slicing variants: [0,2,1] and [-2:-4:-1].

heye0507 · September 23, 2019, 4:41pm

If you are talking about that some of the images in the pascal have no labels, I think you are probably right.

I had similar issue when running the first couple epochs on the baseline model, and it complained about cannot get loss for non-type.

I didn’t try to clean the dataset as I mentioned in another post (need to get this thing working in a week…) I used a try… exception… block to set the unlabeled loss to 0. (No label, any prediction is right?)

As you can tell, it is not a good way to do this. I also thinking to find some times to further investigate the issue after I am done with NLP…

tsar · September 23, 2019, 4:48pm

Actually, there are no images without labeled objects in pascal dataset. The problem is that sometimes all of them become out of bounds after transforms. I described this in the first message of this topic.

heye0507 · September 23, 2019, 6:09pm

Ha, I think you dig deeper than me.

I encountered same problem during initial training, where I didn’t think too much and pre-assumed is labeling issue.

That makes total sense (But I only have a very small fraction of the things actually throw error). There must be something with the BCE loss that kept things running (I suspect is during the BCE loss, I am not counting background. The idea is we check if it is all of the label without putting background into the loss function, at inferencing time, I set 1 thresh_hold, if below it, it will be background). Therefore, during actually training, I am actually not adjusting the weights from background in the backprop…

But I simply set loss to zero if it throw out the issue, I guess there’s more to dig. let me know if you figured out!!

Thanks,

hallvagi · October 7, 2019, 1:05pm

I got the exact same error-message when trying to run object detection on the coco-dataset. I have been using a fast.ai v1 implementation by @Bronzi88, mentioned in this thread: Object detection in fast.ai v1

I tried the examples/CocoTiny_Retina_Net.ipynb from the repo (https://github.com/ChristianMarzahl/ObjectDetection), and got the error message. I think you have identified the correct issue - great debugging btw! Using squishing instead of cropping when resizing seems to be the issue. This is also discussed by Jeremy in lesson 8 (v2) at https://youtu.be/Z0ssNAbe81M?t=5454

A fix that worked for me was using the following code for the databunch:

data = (ObjectItemList.from_folder(
        .split_by_rand_pct()  
        .label_from_func(get_y_func)  
        .transform(tfm_y=True, size=size, resize_method=ResizeMethod.SQUISH)  
        .databunch(bs=64, collate_fn=bb_pad_collate))

Finally it seems certain tranforms like too much rotation can cause the same issue.

Regards
Hallvar

Cdk296 · October 8, 2019, 9:24am

Ahh thank you, I was running into this error since 2 days without understanding it, I like this workaround (I prefer padding reflection nonetheless, I will try also with the squish).

stev3 · March 5, 2020, 5:43pm

I ran into same issue. Fixed by changing this line:

i = torch.min(torch.nonzero(clas_tgt-self.pad_idx))

To this:

 i = torch.min(torch.nonzero(clas_tgt - self.pad_idx)) if sum(clas_tgt) > 0 else 0

Arateris · April 28, 2020, 8:20am

stev3:

I ran into same issue. Fixed by changing this line:

i = torch.min(torch.nonzero(clas_tgt-self.pad_idx))

To this:

 i = torch.min(torch.nonzero(clas_tgt - self.pad_idx)) if sum(clas_tgt) > 0 else 0

Thanks !
I had similar issue here. This one-liner works well !

pankaj_kvhld · June 7, 2020, 4:05pm

Solved for me too. Thanks a ton