There’s a new SOTA architecture for NLP out as of yesterday from Google/Toyota - ALBERT, a new ‘Lite Bert’ with a massive reduction in parameter size (18x…).
It features two changes to achieve those kind of reductions (factorized param embeddings and cross layer param sharing) along with an improved training process.
Unfortunately no code has been released yet but hopefully soon.
I wrote an article on ALBERT here:
and the ArXiv paper is here:
After the trend of larger and larger models (including the massive Megatron, trained on 512 GPUs) it’s exciting to see new architecture improvements deliver higher accuracy with much smaller parameters.
Hopefully the source is released soon and we can port it into the FastAI framework.
Great job @LessW2020. Thank you for writing and sharing well writen articles : clear and concise. I really like both your Medium articles and the topics that you initiated like the ones about Mish, Ranger, Progressive Sprinkles, and many others.