I’m trying to reproduce the results received with FastAI using plain PyTorch script and can not achieve the same numbers. FastAI performs significantly better, so I feel there’s some trick which it uses, and I’d like to learn that trick.
What I’ve learned so far:
FastAI cuts the default “head” (classifier) of the pretrained model and replaces it with custom “head”. It also freezes the “body” and only trains the custom head.
It uses cosine schedule for learning rate and momentum, splitting the training into warmup (25%, learning rate changes from base_lr / 25 to base_lr, momentum from 0.95 to 0.85) and regular training, learning rate changes from base_lr to base_lr / 1e5, momentum from 0.85 to 0.95
It disables the weight decay for bias and normalization parameters.
It does not use default image preprocessor (which performs crop) and instead resizes the image by large side with padding.
I’ve reproduced the same techniques in the plain PyTorch script and still, the performance of FastAI is much better.
My script:
Epoch Loss Accuracy
0 1.3788 0.5423
1 1.3375 0.5689
2 1.3235 0.5468
So fastai script climbs the metrics from the very first epoch and my script stagnates and even falls back. Of course I’m using the same data, same batch size, no augmentations in both script, etc.
What trick am I missing? Is it a special optimizer? I’ve checked and it seems it uses plain Adam.
I’m completely puzzled, would appreciate if someone can shed some light on this difference:
To reproduce, unpack the competition_.zip into the “data/” folder, perform uv sync (uv) and then python fast_ai.py for fast ai script and python plain_pytorch.py for plain pytorch version.
My best guess so far, is that FastAI also unfreezes 1 or 2 layers in the pretrained model’s encoder (even though based on the sources examination whole body is frozen), does anyone know for sure how much of the pretrained “body” if frozen?
So it seems, FastAI always unfreezes the normalization parameters! In this case its BatchNorm layers in the pretrained model. And this perfectly makes sense, since BatchNorm just accumulates the “data flow” statistics and “normalizes” them with 0 mean / 1 variance. For different input data, these statistics will be different, so they should be re-learned on new data.
This is a big revelation for me, I was blindly freezing the whole pretraining model previously. Now running an experiment.
Actually, I’m working on the Fast.ai course and on the 18th lesson which teaches optimizers. The transfer learning bits follow, so I was thinking to try and debug this to understand the differences but there you go, you’ve probably figured it out. I’m bookmarking your post for future reference.
With the update from above numbers have improved, but were still lagging behind. And the trick was to use padding mode “fill with zero” instead of “reflection”.