Finetuning Transformers thread

Continuation of a discussion started of discord #chitchat channel here.

To summarize: the question was why gradual unfreezing approach doesn’t help when finetuning pretrained Transformer-based language models. This topic is aimed for addressing the above question and more broadly compiling together some best practices for the task.

Here is a link to dedicated Weights and Biases project kindly set-up by @morgan. Here one can log the training results for GLUE benchmark tasks for facilitating further analysis.

I’ve set-up a starter notebook for training on GLUE tasks Training on different tasks or models should be as easy as changing model_name or task strings respectively, even if you don’t feel like you have much knowledge of the subject you can still contribute to this investigation and learn some NLP along the way!


Comparison of suggested LRs for different methods as proposed @muellerzr (source here:

Quick report But the suggested lrs, are not very consistent between calls. If I haven’t messed up something overall observations are: valley and minimum seem reasonable, slide recommendations are too high, and steep is the most inconsistent between runs. Also it would definitely make sense to do similar comparison for other tasks, and consider how results relate to ds size etc.

A notebook for reruning it


To continue on the subject I made a blogpost which adds some details on the GLUE tasks and introduces one of tasks specific tricks commonly used when reporting GLUE results: Finetuning Transformers on GLUE benchmark | thoughtsamples.
Hope this will make starting on experiments with LM finetuning even more smooth for those interested