FastAI results not reproducible (image classification)

I’m trying to reproduce the results received with FastAI using plain PyTorch script and can not achieve the same numbers. FastAI performs significantly better, so I feel there’s some trick which it uses, and I’d like to learn that trick.

What I’ve learned so far:

  1. FastAI cuts the default “head” (classifier) of the pretrained model and replaces it with custom “head”. It also freezes the “body” and only trains the custom head.
  2. It uses cosine schedule for learning rate and momentum, splitting the training into warmup (25%, learning rate changes from base_lr / 25 to base_lr, momentum from 0.95 to 0.85) and regular training, learning rate changes from base_lr to base_lr / 1e5, momentum from 0.85 to 0.95
  3. It disables the weight decay for bias and normalization parameters.
  4. It does not use default image preprocessor (which performs crop) and instead resizes the image by large side with padding.

I’ve reproduced the same techniques in the plain PyTorch script and still, the performance of FastAI is much better.

Typical numbers are:
FastAI:
Epoch Loss Accuracy
0 1.132512 0.612162
1 0.926101 0.693694
2 0.895175 0.727928

My script:
Epoch Loss Accuracy
0 1.3788 0.5423
1 1.3375 0.5689
2 1.3235 0.5468

So fastai script climbs the metrics from the very first epoch and my script stagnates and even falls back. Of course I’m using the same data, same batch size, no augmentations in both script, etc.

What trick am I missing? Is it a special optimizer? I’ve checked and it seems it uses plain Adam.

I’m completely puzzled, would appreciate if someone can shed some light on this difference:

To reproduce, unpack the competition_.zip into the “data/” folder, perform uv sync (uv) and then python fast_ai.py for fast ai script and python plain_pytorch.py for plain pytorch version.

1 Like

Requested access to the dataset on Google Drive.

Updated the permissions.

My best guess so far, is that FastAI also unfreezes 1 or 2 layers in the pretrained model’s encoder (even though based on the sources examination whole body is frozen), does anyone know for sure how much of the pretrained “body” if frozen?

1 Like

So it seems, FastAI always unfreezes the normalization parameters! In this case its BatchNorm layers in the pretrained model. And this perfectly makes sense, since BatchNorm just accumulates the “data flow” statistics and “normalizes” them with 0 mean / 1 variance. For different input data, these statistics will be different, so they should be re-learned on new data.

This is a big revelation for me, I was blindly freezing the whole pretraining model previously. Now running an experiment.

1 Like

Actually, I’m working on the Fast.ai course and on the 18th lesson which teaches optimizers. The transfer learning bits follow, so I was thinking to try and debug this to understand the differences but there you go, you’ve probably figured it out. I’m bookmarking your post for future reference.

Good luck with the course!

With the update from above numbers have improved, but were still lagging behind. And the trick was to use padding mode “fill with zero” instead of “reflection”.

Finally the metrics are comparable:

FastAI:

epoch     train_loss  valid_loss  accuracy
0         1.138676    1.132512    0.612162
1         0.746413    0.926101    0.693694
2         0.542063    0.895175    0.727928
3         0.427984    0.823914    0.739189

Plain PyTorch:

epoch     train_loss  valid_loss  accuracy
0         1.0682      1.0205      0.6613
1         0.5619      0.9222      0.6959
2         0.4311      0.8333      0.7333
3         0.3411      0.9344      0.7117

Updated the github repo in case anyone is interested.

1 Like

Hey @fzngagan, I’m on that lesson too and noticed similar inconsistencies. Crazy how small tweaks change results so much. Thanks for sharing!

1 Like

Btw, it’s @canonic_epicure who figured it out.