Training TransformerXL

i haven’t tried. Do you have experience with it

@sgugger tried but didn’t get great results yet.

I am researching the perplexity/loss versus sentence length for languagemodels based on awd-lstm and transformerXL. I may have an idea on how to improve the accuracy, It is so simple that i might make a fool out of myselff so let me verify om the imdb data and return in a couple of days.

Meanwhile if you @sgugger could share your experience with hyperparameters for the transfer learning then that would help + how long time you trained the transformerxl + GPU model ?

I also don’t have great results yet. I trained a transformerXL language model with a 60K vocab on wikitext103 to validation loss 2.799 (perplexity 16.4); it generates interesting text. The language model didn’t finetune well to imdb on my first try (loss 3.7), and the subsequent classifier gave 93.2% accuracy, way below the accuracy of the AWD LSTM. The training of the unfrozen classifier model felt unstable (I had to go to small lr and the loss was jumping between 0.2 and 0.35). I probably still need to figure out something simple.

1 Like

Oh, as I see Jeremy’s lesson I realize that maybe I could try gradient accumulation to deal with the stability of the optimizer. The model was big and I ended up using very small batch sizes, so maybe that could help a bit…

yes fastai.text failes in the last step due to lack of gpu-mem. so strange - i suspect that the mem issue comes from the sortishsampler and padcollate

@Kaspar can you sample the model and maybe show an output or two? and what kind of datasets are you training it on? were you maybe able to replicate the original paper entirely with the results?

well that actually not my focus now. I only have a 1080TI so training transformer XL to full convergence will take at least a week with the current implementation in fastai (ie the GPU is never above 2% but the cpu is at 16%)

My focus is analysing transformerxl’s and awd_lstm’s perplexity/accuracy vs sentence lengths for training the language model and using it for imdb sentiment analysis .


@Kaspar, with the latest fastai I get full GPU utilization (2080Ti or better take less than a day for 10 epochs). I use 150 bptt, 300 mem_len, and 60K spacey vocabulary, as per Sylvain’s suggestion here.

Here is a silly sample following a silly prompt:


= Paul Maria von Weber =

Paul Maria von Weber was a famous concert piano player and composer . Following the enthusiastic reception of the audience after a performance in Carnegie Hall , Paul and family settled in New York City where they lived the rest of their lives . Paul 's childhood travels through Europe influenced the music style and food preferences that came to be known as the von Weber style .

= = Early life = =


Paul Ludwig von Weber was born in the neighborhood of Cologne on February 28 , 1894 , the son of Otto von Weber , a physician and chair of the later branch of the Another Circles family , and his wife Sophie . After his father is dead while in Cologne , Paul and his grandmother moved frequently , under the influence of Ludwig der Weber , but fell in love with Larry von Weber , who had studied piano at the same time . Later , in 1906 , the couple moved to Hamburg when Paul and von Weber were involved in a coup d ’ état .
Paul 's father , Sophie Inflammation Paul ( she was born on March 27 , 1897 ) , was a columbia professor in medicine at the former University of Cologne , and later was an assistant professor in medical at the Gibbs Institute of Medical Fibrous Medicine of the University of Dresden . Paul and von Weber had two younger brothers : Ludwig and Clancy . Over a decade later , Gottlieb Everett von Weber ( 1865 – 1918 ) and Eduard Christoph Philip von Weber ( 1858 – 1920 ) moved from Hamburg to Cologne as they moved to Munich .
Paul o ’ Donnell was first a surgeon in wedge - shaped tube over a geiger counter and rose to the position of the local chief who entertained Paul and Prince Ludwig by calling him " Brother Wilhelm " . Paul came across this group , from a Bolton social club , and became a member of the committee of the Berlin Municipal Hospital ( 1919 – 1920 ) . Paul de Boer of Kassel became a member of the committee ; he went to Pursued Medical Studies and Elective Studies at the University of Vienna and the University of Berlin at Munich . a more serious child acquaintance of Paul , Exacerbated , it is reported that Paul followed his father 's advice to Franz , who opened his own business as a physician as it was he , to whom he made demands . Paul quickly became involved in the struggle for medical investment in Munich . In October 1919 , he married Peter of Championed Medical Education in a large ceremony at the Passion Cathedral . In August 1920 , Paul rented a cottage in Munich but soon took ill and died of edema . a historians And Daniels commented on the tragic events being loved and abroad , stating , " The wooden tree of death seems to have tossed an end to these life - threatening – and idyllic — atrocities . "

= = Historiography and influence = =

Paul considered the work the " most 1924 seminal work of art in European history . " At a time when he was kicked out of the army in 1926 , the works of Lewis R. Max , which transitioned Paul into a master works , both helped to create the new form of perfect art . The book earned him 1941 Nobel Prize in Philosophy .
The historian Martin Cherry writes that solos often struck most notable individuals within the center circle , arguably endeared themselves to writers and composers of the day . Pins , fusing the solos of his followers and those from other first - dimension subjects , would emit a visual energy , and otherwise were especially important later . Overlapping strands of music can be found in works he occasionally created . Paul also wrote There Is a Bet , a collection of birthday cards that offered the musicians time to answer questions . His works are fragments of what they would have weighed in .
In 1942 Paul co - wrote a number of works , including those of an Austrian art based on the Austrian Romantic masters . In the lyrics to his Seventh Symphony , he wrote , " If i have


impressive text . The link points to a qrnn. Is that what your are running ?

This text was from my best trained transformerXL model. I used the link up to to get a tokenized version of the library.

1 Like

super i will try you parameter setting what do you use for bs (batchsize)

bs was 64. Per Sylvain’s advice, I had 0.1 in the following config parameters: output_p, embed_p, ff_p, resid_p; and for learner: clip 0.1, alpha 0, and beta 2. Training was one cycle of 10 epochs with div_factor=5, and then two additional, smaller cycles of 4 epochs each. It’s probably worth trying a single longer cycle.

I can share the exact snippet offline later tonight if the above info is not enough.

thx. i tried your settings for mem_len and bptt but get GPU mem error.

So instead i scale down by using a configuration with configuration :

  • d_inner = 1024 instead of 2100
  • d_model = 128 instead of 410. My vocab is only 4K)so a smaller embedding layer should be ok
  • I also reduce the number of tokens to 1e8 :slight_smile:

This reduces the time for an epoch by 4 and will hopefully make it easier to experiment with imdb
Do you have a link to sylvains advice ?

With regards to memory, I think I also had a very fine model with mem_len 150 and bs 32. I don’t have a link to Sylvain’s advice (I met him at our NYC meetup and he sent a private message); I will double check from home late tonight and update this thread in case there was anything else.

1 Like

I am getting to trying this myself for a different NLP problem. Your note above:

Seems to have the momentum span backward. Is that intentional? Usually the first number if bigger than the second and it is going the opposite way from the LR ramp. Not sure if that could be a source of trouble with training for you, but wanted to point it out since I noticed it just now.


Can someone explain what the pct_start hyperparameter does?

Pct start tells you where to “peak” the momentum and LR in the training cycle.
Default is 0.3.