This week until next Monday is a whole lot of work!
Not full walkthroughs but pretty good run-throughs: Two Minute Papers
Looks like a new channel aimed at going in-depth. His VAE video is great: ArXiv Insights
yes, we will see this in the segmentation problem (unet, linknet and other models)
Single Shot Detection or the SSD Multibox work Jeremy discussed has a lot of new concepts such as matching problem, non maximum suppression, re-thought loss functions.
I found this useful to give me a quick grasp of the terms: https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-detection-in-deep-learning-495ef744fab
Hope you do too
they only covered limited papers, till date. New channel, hopefully we can get to see more.
Good that we have holidays this weekend. Perfect timing !
Going through the code as well as the papers will be extremely challenging.
the camera input goes through SSD in most of the autonomous cars. but of course they have lidars and other instruments as well, generating auxiliary information.
Is the maximum number of objects detected limited by anything? Like if I had 9 anchor boxes, would I be limited to 9 objects being detected?
If you didn’t scale the anchor boxes or change the dimensions, then yes. just 9.
In the examples Jeremy gave, it had 3 size scaling and 3 aspect ratios (1:2, 1:1, 2:1) for each box, so you can detect maximum of 81 objects.
I see your point, but I’m still confused if this would matter when pushing the gradients. The weights are changed based on how much the loss change w.r.t. to them, I’m not certain that if one loss being 10 times bigger than the other directly correlates on the weights being 10x more sensible to this loss.
I’ve this question for a long time, probably I need to do some math to see it
no…the only thing that will limit you is the number of classes you are trying to predict.
For the object detection history slide Jeremy showed, this famous blog post provides a lot of the details: https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9
I also liked a TED talk by the YOLO researcher: https://www.youtube.com/watch?v=Cgxsv1riJhI
…At that particular position where (k=9) anchor boxes were generated.
Right, I meant more “if there were 9 base square boxes and created 9 anchor boxes for each one, you can only detect up to 81 objects.” There will always be limitations for the number of objects you can detect based on your design choice.
I had a lot of context going into this lesson which really helped. Here’s a list of videos and papers, roughly in order, that I went through during my Computer Vision deep dive:
Also arXiv-vanity converts papers to HTML and makes them a bit easier on the eyes.
Of all of them, the two CS231N lectures, both taught by J.Johnson were the most important for me. They both present something of a history and logical progression of Object Detection methods; with the first halves being about R-CNN (region-based ConvNets) and the latter halves on SSDs. I highly recommend these two for conceptual background for this lesson. Everything here (except Focal Loss) is covered in them.
CS231N S2016 L8 — Localization and Detection
CS231N W2017 L11 — Detection and Segmentation
YOLO CVPR 2016 talk – the idea of using grid cells and treating detection as a regression problem is focused on in more detail.
YOLOv2 talk - there is some good information in this talk, although some drawn explanations are omitted from the video. What I found interesting was the bit on learning anchor boxes from the dataset. There’s also the crossover with NLP at the end.
Focal Loss ICCV17 talk – also recommended
I’m sure these have been linked a lot by now, so just here for reference.
I also looked into the deeplearning.ai videos a bit to try and focus in on some topics:
Coursera Andrew Ng videos:
- Object Detection
- Bounding Box Predictions
- Intersection Over Union
- Non-Max Suppression (NMS)
- Anchor Boxes
- YOLO Algorithm
For me going forward, I haven’t gotten to Feature Pyramid Networks yet, and I’m glad to see we’re going to cover it. For the Mask-RCNN type methods, I had a macabre hunch that Facebook would be very good at picking out faces from crowds… and they have some great research on the topic if you want to check it out:
I may have missed a few things but that’s basically it.
As a beginner I had a very hard time understanding these. J. Johnson gave a very high-level idea and no details. I would not recommend these to pure beginners.
Instead I would recommend the deeplearning.ai MOOC course4 Week3 an awesome starting point. and later one can move the above mentioned
Imbalanced loss function would lead to the situation where you have relatively low loss value, but only because the model is doing well on one part of the task, but not the other. My guess is that imbalanced losses lead to overfitting for one task and underfitting for another. Also, it somewhat resembles input and layer normalization, where we scale values in a similar fashion.
Math-wise I think it should be in weight updates: W = W - \alpha (dL/dW), and if L = 1000 \; L_a + L_b, then W = W - 1000 \; \alpha (dL_a/dW) - \alpha (dL_b/dW). Here you have a 1000 times bigger update for L_a than for L_b (assume that L_a and L_b have similar scale).
Hi there, there’s a problem now. The live stream is 1h59m long, and I suppose the lecture has been longer than that, given that the stream starts in the middle of a cell execution. Any way to recover what’s been “lost” for now at the beginning?
I’m particularly interested because I wanted to check if the “missing object” bounding box has been discussed or not (even though I don’t think so, I think it might be worth discussing!)