Reading through the paper (which imho isn’t as clear as the “Attention is All You Need” one) and a bit confused on what use case(s) Transformer XL was designed to solve (besides LM if any).
So …
Is Transformer XL solely about Language Modeling?
In what tasks would the usual Transformer architectures perform better than Transformer XL, and vice-versa?
I’m afraid no one knows the answers to those questions yet
Transformer XL is better at language modeling so it should be better on downstream tasks, but it has been a bit ignored for now in the recent articles.
Yah that was my take on it after getting through the paper.
How is the fastai implementation working? I remember reading thru some posts back in the day that ULMFiT was still beating it and that the results to date weren’t all that great.
Hmmm … yah, that’s interesting given that the premise that a better LM would produce a better classifier since it has a better grasp of the patterns in the particular document.
Are you all training it the same way you train ULMFit?
Is the performance bad all around or does it get worse after you unfreeze?
Have you all tested it against something like SentencePiece vs. the default fastai tokenizer?