Focal loss for a language model

Has anyone had success with using Focal Loss as to build a language model? With a large vocab, it would seem like it could really help to predict some of the less-frequent words in the language. If you have tried this out, please let me know. I am playing with it now, and will post results here (positive or negative) so that they can be tracked.