Big Transfer (BiT) paper from Google Brain

(Morgan McGuire) #1

Highlighting the Big Transfer (BiT) paper from google that @jeremy pointed out on twitter, achieves SOTA on a wide variety of downstream computer vision tasks with fairly standard fine-tuning.

BiT-Large pre-trained on JFT-300M dataset appears to be a fairly simple pre-training process and architecture, Interestingly (for me anyways) they ditched BatchNorm in favor of a combo of GroupNorm + Weight Standardisation in order to train sufficiently large batches.

Fine-tuning uses SGD, Mixup, fixed resolution scaling rules, random square crops and image flips (where appropriate).

They will be releasing code + weights + examples soon

Aside from ditching BatchNorm, nothing seems to be too revolutionary (or maybe I’m wrong), glad to see these fastai basics are still killing it :smiley:

They mention that additional hyperparamter tuning could lead to better results again. With @LessW2020’s optimizer work and @Diganta’s Mish activation function the community here could probably push these results In guessing :wink:

Tweet: https://twitter.com/giffmana/status/1214240746095730688?s=20

Paper: https://arxiv.org/pdf/1912.11370.pdf

8 Likes

(Asimo) #2

they also use no Dropout or Weight Decays…
But I agree with you, this boundary can be pushed further with new optimizers, training schedules, gradual unfreezing, gradual image size increase etc…=)

I think we as fastai community can perform systematic experiments and make nice repo on best practices how to do transfer learning and achieve state of the art results =)

2 Likes

(Morgan McGuire) #3

Ah true true, thanks for highlighting

Sounds like a paper opportunity, similar to the Bag of Tricks paper!

1 Like

(Asimo) #4

I wonder what is the most systematic way to go about this ?

  1. Maybe design all the transfer leaning experiments and choose few datasets (classification).

  2. Create executable scripts

  3. And out source to people.

Any more ideas ?

0 Likes

(Morgan McGuire) #5

All sounds reasonable to me.

Might be an idea to limit it to vision classification at first? How to break out architecture tweaks from training techniques (e.g. training schedule)?

1 Like

(Asimo) #6

vision sounds good =)

Here is the list things we could try
just substituting last nn.Linear

-training with SDG (baseline)
-Adam One Cycle
-Mish + Adam, One Cycle 
-Mish + Radam (cos aneal)

Adding Fastai Tail (AdaptiveConcatPool2d, Flatten, bn, drop, act, nn.Linear etc)

trying unfreeze everything with following:

-training with SDG (baseline)
-Adam One Cycle
-Mish + Adam, One Cycle 
-Mish + Radam (cos aneal)

5 freeze + 15 unfreeze

-training with SDG (baseline)
-Adam One Cycle
-Mish + Adam, One Cycle 
-Mish + Radam (cos aneal)

5 freeze + 15 unfreeze( but different learning rates)

-training with SDG (baseline)
-Adam One Cycle
-Mish + Adam, One Cycle 
-Mish + Radam (cos aneal)

Dont forget, Mixup, Label smoothening .

Anything else to add or I am missing, we should add all possible combination and from there try to shorten this list.

1 Like

(Michael) #7

Also testing a tail/head with a GeM pooling layer?

The idea is based on Dmytro Mishkin:

ConcatPool is simple [avg, max] pool. It is the 2nd best option after GeM pooling in my experience.

And of course the optimizers from @LessW2020:
Ranger & newer optimizers

1 Like

#8

Hi, I’m new to CV but I think this would be really interesting. Please message me. I am happy to help anyway I can.

0 Likes

(Asimo) #9

I am working my way thru new version of fastai after I am done, I will create repo with the table of possible transfer learning experiment =)

1 Like

(Diganta Misra) #10

I’ll try to train BiT from scratch on ImageNet with Mish + AdamW as for starters.

3 Likes

(Morgan McGuire) #11

Awesome, was out for a few days there, your list above sounds like a good place to start, thanks @DrHB! Yep its a good idea, would like to get myself up to speed on fastai2 too before kicking off new work.

2 Likes