Reducing labeled data needs - CPC 2.0 from DeepMind

DeepMind published a new paper called “Data Efficient Image Recognition” and introduced CPC 2.0. They accomplish new state of the art on object recognition via transfer learning a CPC trained ResNet and more importantly, set new milestones for training with 2-5x less data:

CPC for vision is basically taking an image, clipping it into overlapping patches, creating feature vectors from each patch and then training the NN by asking it to pick a feature vector from the bottom of the image amongst a series of negative feature vectors from other images.
In other words, it helps it build better representations of the objects in the image.

I wrote a summary article with more info here:

And full paper is here:

The authors indicate CPC 2.0 will be open sourced soon, so hoping we can look at integrating it into FastAI 2.0 :slight_smile:

Best regards,


This is quite good. Thanks for sharing!

Can you please start a discussion here again when the official code is released?

1 Like

re: discussion when code is released - definitely :slight_smile:

In appendix A.2 on page 14 in the publication they have outlined the setup in pseudo code. Looks quite compact but you have to be careful to keep track of the tensor dimensions.

The setup reminds me of language model pretraining for image data.

(There was also recently posted a nice detailed summary of other self-supervised representation learning approaches. The first image in the article taken from a talk from LeCun is a great visual explanation.)


I was looking through the pseudo code in detail:

batch_dim = B
batch of images [B×7×7×4096]

pixelCNN = context network
latents [B×7×7×4096]
cres [B×7×7×4096]

Downsampling in the pixelCNN:
[B×7×7×4096] → [B×L×L×256] → [B×7×7×4096] (L = pixel size, not calculated for the example)

However, I am asking myself why the pixelCNN is going 5x through the for loop and adds c to cres?

CPC loss
col_dim = 7
row_dim = 7
target_dim = 64
targets [B×7×7×64] → [(B×7×7)×64]
col_dim_i = 7 - i - 1
preds_i [B×7×7×64] → [(B×7×7)×64]
logits [(B×7×7)×64] @ [64×(B×7×7)] → [(B×7×7)×(B×7×7)]

However, I am still struggling with the labels part below in the code, i.e., b, col, labels, and loss calculation. Maybe somebody else is also trying to make sense out of it and wants to discuss it?

(PS: cross post on reddit publication thread) article in the direction of this topic:


Thanks for posting this @MicPie - now I’m very interested to checkout the ‘fine tune’ option in FastAI2.

There’s another paper out on using weakly labeled data first and then using less than 10% of labeled data, meet or beat SOTA. I’ll try and post that paper out shortly (need to find it again).

1 Like

@LessW2020 from what I can see it’s basically unfreezing mid training:

Looked at the source a bit more as I said that. The freezing runs at 2x the base learning rate chosen, and the training starts with the tail end of one-cycle before unfreezing


oh you’re right, thanks for pointing this out - I had visions of something much more intricate but it’s basically compressing a couple standard steps.

def fine_tune(self:Learner, epochs, base_lr=1e-3, freeze_epochs=1, lr_mult=100,
          pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
self.fit_one_cycle(freeze_epochs, slice(base_lr*2), pct_start=0.99, **kwargs)
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)

Was it this one?

(GitHub repo)

It is hard to keep up with the output of publications from Google & Co.
(We need another thread for papers like that to share & discuss. #toolongreadinglist)! :wink:


A very nice PDF slide deck on Self-Supervised Learning with a lot of nice figures:


FixMatch, simpler and yet powerful (Re)MixMatch:


Thanks @MicPie for posting this paper. FixMatch looks much more straightforward to implement while being more powerful. Great to see!

Interesting repository could be good reference for CPC in fastai2.

PyTorch implementation of Data-Efficient Image Recognition with Contrastive Predictive Coding


A very nice blog post explaining contrastive self-supervised learning:


Hi MicPie Hope your having a wonderful day!
Wow a very enlightening and informative post, I am convinced we need to change IP as a metaphor human brain.

Cheers mrfabulous1 :smiley: :smiley:


This is the next level:
No labels, just the text annotation needed to get very good embeddings.


I was just reading about that today. I wonder if it has some applications in the text/tabular realm too (curious to see if anyone starts playing with it)

The DAIR paper reading meet-up is covering this paper this Saturday, worth joining if you have the time!