Unsupervised semantical segmentation learning

Hi everyone,
I’ve seen a YouTube video (https://www.youtube.com/watch?v=ydh_AdWZflA) where a robot pick pieces in a bin automaticaly.
The robot is able to detect the pieces and knows how to grip them thanks to a deep Learning algorithm.
We can see on the video that the robot takes a picture of the bin and then uses a CNN (tell me if I’m wrong) to make a semantic segmentation of the image (with to categories : grippable points and non-grippable points).
Then it tries ton grip a part in the “grippable” region, this way it can learn and refine the segmentation.

My question is : what kind of network or what architecture can I use to reproduce this behavior ?
As we do not label the regions before, I guess it’s a kind of unsupervised Learning and I’d like to know how it works !

Thank you very much

This looks a reinforcement learning problem.
They use a CNN to identify the objects I believe. But practice seems to allow the robot learn that cylinders vs edges are better.

You mean that we could improve the semantic segmentation with only trying to grab a piece and getting one success/failure feedback, acting as the reward ?

It is feasible that the more the robot practices the more the network identifies that grabbing certain parts of the object lead to success. I’m not sure it improves segmentation. I think it just focuses attention on part of the object that leads to better success