Meet ALBERT: new Lite Bert, SOTA NLP and 18x smaller params

Hi all,
There’s a new SOTA architecture for NLP out as of yesterday from Google/Toyota - ALBERT, a new ‘Lite Bert’ with a massive reduction in parameter size (18x…).
It features two changes to achieve those kind of reductions (factorized param embeddings and cross layer param sharing) along with an improved training process.
Unfortunately no code has been released yet but hopefully soon.
I wrote an article on ALBERT here:

and the ArXiv paper is here:

After the trend of larger and larger models (including the massive Megatron, trained on 512 GPUs) it’s exciting to see new architecture improvements deliver higher accuracy with much smaller parameters.

Hopefully the source is released soon and we can port it into the FastAI framework.
Best regards,


Great job @LessW2020. Thank you for writing and sharing well writen articles : clear and concise. I really like both your Medium articles and the topics that you initiated like the ones about Mish, Ranger, Progressive Sprinkles, and many others.

Have a great week-end

1 Like

Thanks @farid - greatly appreciate the support and feedback!
Have a great weekend!


You welcome.

Looking forward to learning more about ALBERT in this forum :slightly_smiling_face:

1 Like

For those interested in learning more about BERT, check out these 2 good resources that have nice explanations:

Video: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Blog post: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

For those how are not familiar with " Attention Is All You Need" article, 2 good references (from the same authors) here below:
Video: Attention Is All You Need
Blog post: The Illustrated Transformer