Training TransformerXL

Here is the grid search for pct_start using

  • learn.fit_one_cycle(cyc_len=epochs, max_lr=5e-4, moms=(0.75, 0.85), wd=0, pct_start=g.pct_start)
  • bptt=70. had to reduce it from 94 due to a change in fastai
  • nb training tokens = 2.5e6


If we look at the loss after the first epoch then the difference is more pronounced:
This clearly show that using a high “pct_start” impact the training in a negative way

Status is that pct_start=0.02 found in the previous post is the best value so far.

Here is weight decay with minimum loss for wd=1e-5 :

Here is search for max_lr with:
-nb training tokens = 2.5e8 and validation tokens 0.5e8
-drop-mult = 0.05
-learn.fit_one_cycle(cyc_len=1, max_lr=lr, moms=(0.75, 0.85), wd=1e-5, pct_start=0.02)

This confirms that 5e-4 is the best value for max_lr. The decrease in loss fades out when approaching mx_lr = 5e-4 . We saw earlier max_lr > 6e-4 tend to be unstable

Here is the search for moms with:

  • learn.fit_one_cycle(cyc_len=epochs, max_lr=5e-4, moms=(g.moms-0.05, g.moms+0.05), wd=1e-5, pct_start=0.02)


Here is the search for moms_range with:

  • learn.fit_one_cycle(cyc_len=epochs, max_lr=5e-4, moms=(0.79-moms_span, 0.79 + moms_span), wd=1e-5, pct_start=0.02)


The search for moms and its range does not show a clear minimum so we keep moms = 0.79-0.05 , 0.79+0.05. More precise settings require running the search for more epochs

Hi Kaspar, could you possibly share your pipeline for preprocessing wikipedia (or wikitext103) using sentencepiece? Thanks!

i will do so in a weeks time. currently doing gridsarch for hyperparameters for training awd-lstm

1 Like

Hi again. Happy to help with the search if you can provide the exact prep process. It’s the little details on the exact datasets and on how to tokenize that would otherwise make our numbers incomparable.

the repo is here:

The important files are :

  • ULMFiT-Sentencepiece-Wiki-Training.ipynb

I am only using transformexl and awd-lstm

I have done some cleaning and is currently testing whether that broke anything - will let you know.

I believe that the script will work for all western languages. For other languages you will have to adapt all the regular expression stuff in the beginning of and the rules in wikijson2TrainingData.

The cleaning of the wiki_text can be improved by taking a more systemtic approach to finding sentence with outliers (ie sentence with too many control characters).

I keep parentheses “( some text or number)” but i suspect that the language model will be better if parentheses and their content were removed. Some sentences have so many parentheses that they are difficult to read - even for humans


i mostly use windows but the script works better on linux

Thank you @Kaspar for the great insight into the model’s hyper parameters. I’m using them to train with a Spanish corpus, and it’s working pretty well! 29.99 perplexity with 60k vocab. Have you played with mem_len as well?

1 Like

no but its tempting to reduce in order to have a more lightweight model

Has anyone had any success fine-tuning TransformerXL for classification on IMDb as yet?

1 Like

i haven’t tried. Do you have experience with it

@sgugger tried but didn’t get great results yet.

I am researching the perplexity/loss versus sentence length for languagemodels based on awd-lstm and transformerXL. I may have an idea on how to improve the accuracy, It is so simple that i might make a fool out of myselff so let me verify om the imdb data and return in a couple of days.

Meanwhile if you @sgugger could share your experience with hyperparameters for the transfer learning then that would help + how long time you trained the transformerxl + GPU model ?

I also don’t have great results yet. I trained a transformerXL language model with a 60K vocab on wikitext103 to validation loss 2.799 (perplexity 16.4); it generates interesting text. The language model didn’t finetune well to imdb on my first try (loss 3.7), and the subsequent classifier gave 93.2% accuracy, way below the accuracy of the AWD LSTM. The training of the unfrozen classifier model felt unstable (I had to go to small lr and the loss was jumping between 0.2 and 0.35). I probably still need to figure out something simple.

1 Like

Oh, as I see Jeremy’s lesson I realize that maybe I could try gradient accumulation to deal with the stability of the optimizer. The model was big and I ended up using very small batch sizes, so maybe that could help a bit…

yes fastai.text failes in the last step due to lack of gpu-mem. so strange - i suspect that the mem issue comes from the sortishsampler and padcollate

@Kaspar can you sample the model and maybe show an output or two? and what kind of datasets are you training it on? were you maybe able to replicate the original paper entirely with the results?

well that actually not my focus now. I only have a 1080TI so training transformer XL to full convergence will take at least a week with the current implementation in fastai (ie the GPU is never above 2% but the cpu is at 16%)

My focus is analysing transformerxl’s and awd_lstm’s perplexity/accuracy vs sentence lengths for training the language model and using it for imdb sentiment analysis .


@Kaspar, with the latest fastai I get full GPU utilization (2080Ti or better take less than a day for 10 epochs). I use 150 bptt, 300 mem_len, and 60K spacey vocabulary, as per Sylvain’s suggestion here.

Here is a silly sample following a silly prompt:


= Paul Maria von Weber =

Paul Maria von Weber was a famous concert piano player and composer . Following the enthusiastic reception of the audience after a performance in Carnegie Hall , Paul and family settled in New York City where they lived the rest of their lives . Paul 's childhood travels through Europe influenced the music style and food preferences that came to be known as the von Weber style .

= = Early life = =


Paul Ludwig von Weber was born in the neighborhood of Cologne on February 28 , 1894 , the son of Otto von Weber , a physician and chair of the later branch of the Another Circles family , and his wife Sophie . After his father is dead while in Cologne , Paul and his grandmother moved frequently , under the influence of Ludwig der Weber , but fell in love with Larry von Weber , who had studied piano at the same time . Later , in 1906 , the couple moved to Hamburg when Paul and von Weber were involved in a coup d ’ état .
Paul 's father , Sophie Inflammation Paul ( she was born on March 27 , 1897 ) , was a columbia professor in medicine at the former University of Cologne , and later was an assistant professor in medical at the Gibbs Institute of Medical Fibrous Medicine of the University of Dresden . Paul and von Weber had two younger brothers : Ludwig and Clancy . Over a decade later , Gottlieb Everett von Weber ( 1865 – 1918 ) and Eduard Christoph Philip von Weber ( 1858 – 1920 ) moved from Hamburg to Cologne as they moved to Munich .
Paul o ’ Donnell was first a surgeon in wedge - shaped tube over a geiger counter and rose to the position of the local chief who entertained Paul and Prince Ludwig by calling him " Brother Wilhelm " . Paul came across this group , from a Bolton social club , and became a member of the committee of the Berlin Municipal Hospital ( 1919 – 1920 ) . Paul de Boer of Kassel became a member of the committee ; he went to Pursued Medical Studies and Elective Studies at the University of Vienna and the University of Berlin at Munich . a more serious child acquaintance of Paul , Exacerbated , it is reported that Paul followed his father 's advice to Franz , who opened his own business as a physician as it was he , to whom he made demands . Paul quickly became involved in the struggle for medical investment in Munich . In October 1919 , he married Peter of Championed Medical Education in a large ceremony at the Passion Cathedral . In August 1920 , Paul rented a cottage in Munich but soon took ill and died of edema . a historians And Daniels commented on the tragic events being loved and abroad , stating , " The wooden tree of death seems to have tossed an end to these life - threatening – and idyllic — atrocities . "

= = Historiography and influence = =

Paul considered the work the " most 1924 seminal work of art in European history . " At a time when he was kicked out of the army in 1926 , the works of Lewis R. Max , which transitioned Paul into a master works , both helped to create the new form of perfect art . The book earned him 1941 Nobel Prize in Philosophy .
The historian Martin Cherry writes that solos often struck most notable individuals within the center circle , arguably endeared themselves to writers and composers of the day . Pins , fusing the solos of his followers and those from other first - dimension subjects , would emit a visual energy , and otherwise were especially important later . Overlapping strands of music can be found in works he occasionally created . Paul also wrote There Is a Bet , a collection of birthday cards that offered the musicians time to answer questions . His works are fragments of what they would have weighed in .
In 1942 Paul co - wrote a number of works , including those of an Austrian art based on the Austrian Romantic masters . In the lyrics to his Seventh Symphony , he wrote , " If i have