Use of image inpainting as a pre-text task

Hi folks,

I am currently exploring the field of self-supervised learning and I started with @jeremy’s blog post on the same to get myself introduced to the topic.

Now, I am trying to do the following:

  • Train an image inpainting model on the CIFAR10 dataset using Partial Convs.
  • Use the knowledge of the image inpainting model for a downstream task like image classification on the same CIFAR10 dataset.

Here’s what I have done so far:

  • Trained an image inpainting model on the CIFAR10. Here’s a blog post around it to explain some of the thought processes behind it. Note that this is joint work with one of my college juniors Ayush.

We followed a U-Net based architecture there with encoders and decoders. Now, what I’d want is to take the encoder part of the image inpainting model and feed it to a downstream classification top.

Some details about the image inpainting model:

This is how the data is fed to the image inpainting model:

Where, Labels are the images the inpainting model is trying to reconstruct and [Masked_Images, Masks] are fed together as inputs to the model.

Now in case of the downstream task i.e. classification, what I am doing is keeping the masks and masked images to be just the same as original images and labels, in this case, are the actual class labels.

Coming to the downstream model, I am only considering the encoder part of the inpainting model as mentioned before. And on top of that, I am attaching a classification top.

So far, the performance of the downstream model hasn’t been great. I wanted to have some feedback on this approach and I’d be really grateful :slight_smile:

I’ve one doubt, by Partial Convolution, you mean output is only depend upon unmasked inputs?

If so, this might be rather silly, but don’t you think model itself is supposed to learn what to ignore instead of feeding those masks manually?

In the current architecture, we are supplying both the masked image and the corresponding mask as the features and labels are the original images as I mentioned in the intimation post.

The masks are fed to the encoder because of the way it’s defined. It accepts a masked image along with the mask associated with it.