Language Model Zoo šŸ¦

Thanks for your AWS instance (c5.18xlarge) tips.

Do you mind sharing the corpus size you were working with? Did you use SentencePiece/Spacy for tokenization?

cc: @shoof

I was curious about this too, quick search shows some options here:

The following works for me connecting to a remote machine via vnc

watch -n 1 nvidia-smi

2 Likes

I compared loading weights from the pre-trained LM vs no weights before training the LM using target corpus. To my surprise, there wasnā€™t any difference in loss (both started at 10). I re-ran the lecture notes to compare. Without weights the loss started at 10, and with weights, it started at 5.x

I noticed that wgts['0.encoder.weight'].shape from wiki103 was [238462, 400], whereas I had only [32000, 400] due to the vocab size limit from Sentencepiece. The target corpus vocab from the lecture was capped at 60k, and mine was at 33k. Would a smaller vocab in the pre-trained weights make the pre-trained LM less helpful? Iā€™m only guess because I canā€™t seem to find any other reason.

Just to clarify (I think you already know this): you canā€™t use my pre-trained wiki103 weights, since thatā€™s a completely different (non-sentencepiece, English) vocab. Thatā€™s why youā€™re pretraining with your own corpus instead.

In my experiments 32000 is plenty for sentencepiece. The fact that your starting weights arenā€™t helping doesnā€™t surprise me - thatā€™s exactly what youā€™d expect if you had the bug I mentioned. Perhaps you somehow changed your sentencepiece vocab after training the original LM?

3 Likes

Hi Christine,
This is really awesome!

Question: In your function(s) for generating predictions, did you have to add in any conditions to prevent repetition / low diversity in the predictions? Iā€™ve been training various word-level language models on text from different domains, mostly following along with Lesson 6 and Lesson 10 materialsā€¦but even for models that seem reasonably well-trained in terms of loss / accuracy, I keep generating text that repeats similar phrases over and over again.

Havenā€™t had this issue so much with char-level models, but definitely most of my word-level ones.

Thanks in advance for any guidance you may have,
Chloe

Hi - Yes, I had this exact problem also! The language model gives a probability for every word. The lesson notebook example always picks the most likely word, but I found the results were nicer if I kept the most likely 2-3 words and then sometimes picked randomly out of those.

So, say the sentence is ā€œMy favorite food isā€¦ā€, then the model might be offering [ā€œpizzaā€, ā€œcakeā€, ā€œpastaā€ā€¦] etc. as guesses for the next word. Rather than always going for prediction #1, I sometimes randomly pick out of the top 3 guesses.

So as I looped through, Iā€™d usually pick the most likely word, but then 1 out of every 4 or 5 words, Iā€™d pick a more randomly selected word. There are other ways to set up the selection though - numpy can do random selection by probability, so for every word make it 60% likely to choose the best guess word, 30% likely to choose the 2nd guess, 10% likely to choose the 3rd. Or make it choose based on the probabilities the language model spits out.

Hope this helps! I donā€™t have the notebook in front of me right now, but let me know if more detail would be helpful & I can track it down.

8 Likes

FYI the easiest way to do something like what @mcleavey describes is torch.multinomial along with the probabilities from the model.

5 Likes

Thanks Christine!! Super helpful, and finally Iā€™m getting some high-quality gibberish :).

And thanks Jeremy as well - before I saw your tip about torch.multinomial, I ended up modifying the sample_model function in one of the Part 1 notebooks. So, now instead of picking highest prob word other than <unk> every time, it randomly selects from the top k predictions every 3 words (all credit to @mcleavey for her direction!!):

    for i in range(l):
        n=res[-1].topk(k)[1]
        if i % 3 == 0:
            ix = np.random.choice(k, 2, replace=False)
            n = n[ix[1]] if n.data[ix[0]] == 0 else n[ix[0]]
        else: 
            n = n[1] if n.data[0]==0 else n[0]  

Hereā€™s a sample from a model trained on 7 MB of Plato (I manually added in line breaksā€¦):

Input: ā€œSocrates was sentenced to deathā€¦ā€

Output:
"ā€¦and he was not a good man , nor a man a bad man who was a bad man . And now , if you are right , you may be right in saying that I am not right in saying so , but that I am not mistaken . And I am certain that I am not mistaken ā€¦
I will tell me , my friends , what you mean by the name of a man , and what is the name of virtue , and what is the meaning of the word ā€™ being , ā€™ and how the name of the name is to be attributed to the one ? ā€¦
THEAETETUS : Yes ; that is my meaning .
STRANGER : And the name of the name is that which is not the name of the name of the name .
THEAETETUS : I should say not .
STRANGER . And if a person were to say that he was a king , and a son a father , and that the son of a father , and a brother , and a son of a father , the son of a father , who was a father , would have been a father , and would be a son of a mother , and of a father or a mother ?
THEAETETUS : No , indeed .
SOCRATES : And if the son of a father is a mother , or of a father or a mother ? THEAETETUS : Certainly not .
SOCRATES : And he who is of the same mind may be supposed to have no name in his name ?
HERMOGENES : Certainly , Socrates .
SOCRATES : And the same may appear to be the case

7 Likes

I looked into the source code that @binga and @ppleskov has. When calling get_all method, I encountered warning

 Warning: no model found for 'en'

  Only loading the 'en' tokenizer.

Looks like it is related to Spacy linking issue. Anyone else had the same problem? How did you resolve it?

Hey all,

@jeremy pointed me to this thread yesterday. Totally wasnā€™t aware of the awesome work youā€™re doing! Iā€™m quite busy until May 22 (EMNLP deadline) but would love to help out where I can afterwards.

8 Likes

Welcome Sebastian ā€¦!! Itā€™s really Awesome to have you on the team ā€¦

1 Like

Hey Nahid ā€¦ I once encountered the problem sometime beforeā€¦ I just reinstalled it and it workedā€¦

1 Like

I think @binga used the spacy EN tokenizer for pre-segmented non-English text, which could have some potential issues and Jeremy recommended using Sentencepiece for tokenization as an alternative (I think it may be even better than the spacy tokenizer for non-Roman languages). I used it for Chinese and it segments and tokenizes in one shot. Worth a try although youā€™d need to build its segmentation model first (still need to write something about it once my results look good).

2 Likes

Something that would be really helpful is for folks doing this to find a useful classification dataset in your language, and add to the wiki post at the top a link to the dataset(s), and to any paper(s) that show a benchmark classification result for that dataset, and also jot down the classification accuracy from the paper for that dataset.

That way, we can try to build (semi)-automated tools that include building and testing a classifier.

If you canā€™t find any NLP classification papers for a corpus in your language, then of course you wonā€™t be able to show a benchmark - in that case, it would still be great to provide a link to a labeled dataset in your language (e.g. newspaper articles by topic, or social media posts by category, or restaurant reviews with # stars, etcā€¦

4 Likes

Could you please elaborate by what you mean here: Iā€™m also gaining a few points by allowing a longer time to relax at the end: use_clr_beta = (10,33,0.95,0.85)?

Thanks

The 1cycle has three phases:

  1. the learning rate grows as the momentum shrinks
  2. the learning rate gets back to the minimum as the momentum rises again
  3. we annihilate the learning rates (linearly going to 1/100th of its value in use_clr_beta)

The time to relax at the end I mention is the third phase. In the examples of Leslie Smith, for instance super convergence on cifar 10, we use 10 to 15% of our budget in epochs for this third phase, and I found it better to use more when training a LM: the 33 in use_clr_beta means 33% (or one third) for the third phase, which makes one third of the budget for each phase.

6 Likes

Thanks for an insightful explanation. Want to check my understanding, I recall reading previously that using 13.86% (or there abouts),of the total number of epochs was recommended, was that based on vision models and for LM 33% is recommended?

The 13.68% were just to replicate an experiment from Leslie Smith: he had done a cycle with 82 epochs for phase 1 + 2 and 13 epochs for phase 3, so the third phase was 13/95 = 13.68% of the budget.
In general, in his paper he used 10 to 15%, but as I said, thatā€™s worth experimenting since sometimes more helps.

4 Likes

How to edit the wiki to include Brazilian Portuguese ?

Click the ā€˜editā€™ button on the bottom right of the post.