Reducing labeled data needs - CPC 2.0 from DeepMind

(Less ) #1

DeepMind published a new paper called “Data Efficient Image Recognition” and introduced CPC 2.0. They accomplish new state of the art on object recognition via transfer learning a CPC trained ResNet and more importantly, set new milestones for training with 2-5x less data:

CPC for vision is basically taking an image, clipping it into overlapping patches, creating feature vectors from each patch and then training the NN by asking it to pick a feature vector from the bottom of the image amongst a series of negative feature vectors from other images.
In other words, it helps it build better representations of the objects in the image.

I wrote a summary article with more info here:

And full paper is here:

The authors indicate CPC 2.0 will be open sourced soon, so hoping we can look at integrating it into FastAI 2.0 :slight_smile:

Best regards,


(nirant) #2

This is quite good. Thanks for sharing!

Can you please start a discussion here again when the official code is released?

1 Like

(Less ) #3

re: discussion when code is released - definitely :slight_smile:


(Michael) #4

In appendix A.2 on page 14 in the publication they have outlined the setup in pseudo code. Looks quite compact but you have to be careful to keep track of the tensor dimensions.

The setup reminds me of language model pretraining for image data.

(There was also recently posted a nice detailed summary of other self-supervised representation learning approaches. The first image in the article taken from a talk from LeCun is a great visual explanation.)

1 Like

(Michael) #5

I was looking through the pseudo code in detail:

batch_dim = B
batch of images [B×7×7×4096]

pixelCNN = context network
latents [B×7×7×4096]
cres [B×7×7×4096]

Downsampling in the pixelCNN:
[B×7×7×4096] → [B×L×L×256] → [B×7×7×4096] (L = pixel size, not calculated for the example)

However, I am asking myself why the pixelCNN is going 5x through the for loop and adds c to cres?

CPC loss
col_dim = 7
row_dim = 7
target_dim = 64
targets [B×7×7×64] → [(B×7×7)×64]
col_dim_i = 7 - i - 1
preds_i [B×7×7×64] → [(B×7×7)×64]
logits [(B×7×7)×64] @ [64×(B×7×7)] → [(B×7×7)×(B×7×7)]

However, I am still struggling with the labels part below in the code, i.e., b, col, labels, and loss calculation. Maybe somebody else is also trying to make sense out of it and wants to discuss it?

(PS: cross post on reddit publication thread)


(Michael) #6 article in the direction of this topic:


(Less ) #7

Thanks for posting this @MicPie - now I’m very interested to checkout the ‘fine tune’ option in FastAI2.

There’s another paper out on using weakly labeled data first and then using less than 10% of labeled data, meet or beat SOTA. I’ll try and post that paper out shortly (need to find it again).

1 Like

(Zachary Mueller) #8

@LessW2020 from what I can see it’s basically unfreezing mid training:

Looked at the source a bit more as I said that. The freezing runs at 2x the base learning rate chosen, and the training starts with the tail end of one-cycle before unfreezing


(Less ) #9

oh you’re right, thanks for pointing this out - I had visions of something much more intricate but it’s basically compressing a couple standard steps.

def fine_tune(self:Learner, epochs, base_lr=1e-3, freeze_epochs=1, lr_mult=100,
          pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
self.fit_one_cycle(freeze_epochs, slice(base_lr*2), pct_start=0.99, **kwargs)
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)

(Michael) #10

Was it this one?

(GitHub repo)

It is hard to keep up with the output of publications from Google & Co.
(We need another thread for papers like that to share & discuss. #toolongreadinglist)! :wink: