Language Model Zoo šŸ¦

I tried pdfminer, but prefer the simpler scripting I have set up. See: extract.sh

3 Likes

Use the sentencepiece python module to create a list of ids for each of your docs. fastai.text works with lists of ids (i.e. weā€™re simply using sentencepiece to ā€˜numericalizeā€™ for us).

No youā€™re not losing information - itā€™s still there, just split in to multiple tokens.

1 Like

Just to share a few first results on the wiki103 set. French seems indeed an easier language for a LM!
I could still use learning rates up to 8 and for now Iā€™ve done a cycle of length 2 and a cycle of length 4. For the cycle of length 4 I used half the droupouts you found best Jeremy, and was still underfitting. The results are:

  • cycle of length 2: 3.97 validation loss (51.4 in perplexity)
  • cycle of length 4: 3.69 validation loss (40 in perplexity)

Note that it took barely 3 hours and a half to train that second model on a Paperspace P6000, so thatā€™s not too bad.

6 Likes

And to complete the loop, a cycle of length 10 gets an even better validation loss at 3.52 (33.78 perplexity), again with the dropouts halved.
Iā€™m still slightly underfitting at the end of training (3.59 of training loss) so they could probably be reduced even more.

2 Likes

FYI I think you should try wikitext-2, using the full vocab (wikitext-2 already has all freq<3 words replaced with UNK, so it shouldnā€™t be too big). You might find you have something close to SoTA. If so, you could then implement the pointer-cache thing (just copy it from AWD-LSTM - theyā€™ve already implemented it there) and you should find that you get SoTA. That would be pretty cool!

1 Like

Oh also, look out for gradient clipping. I have it enabled in my sample code, but you may want to reduce it or remove it in these experiments, since it may get in the way of exploiting super-convergence.

1 Like

Oh this is smaller, perfect for experimenting!
Iā€™ll try a bit of all of this.

This seems great work. Can you please share about how you collected and combined the data?

Thanks! I used two python libraries: requests and BeautifulSoup to loop through each page and extract the text & link to the next page. requests handles the html requests, and then BeautifulSoup gives you a way to extract parts of the page by tag. I can post a sample jupyter notebook for that if you like. I donā€™t know of a way to do this generally ā€“ I had to set up a loop for each overall site ā€“ one for Merck, one for MSF, etc. (Just to note - when I do something like this, Iā€™m always extremely careful to make sure I havenā€™t created any infinite loops or anything else that would overwhelm the site Iā€™m requesting from.)

At that point, I just saved files as text files (pulling some chunks over to a validation set), and ran them similarly to other language models. One thing is that the language model is quite sensitive to the proportion of text ā€“ ie: if I have 90% Merck manual, and 10% patient case studies, the generated text will sound much more like a Merck manual.

Some thoughts on what Iā€™d like to try next: Adding a GAN (though Iā€™m wondering how well the error signal can get back to the LSTM since itā€™ll be through the language generation step. Maybe a CNN generator is better??). Trying training the LSTM language model to predict the next 4 words rather than 1 (possibly a crazy though, but Iā€™m wondering if I can make the model more accurate by making its training job more demanding). Finding a corpus of medical question/answer (possibly USMLE? ideally something with shorter/easier/more uniform questions). Using the model to do text->keywords (ie, give it a written radiology report and ask it to output a few labels). Another crazy thought, but wondering if I could do a cycle gan and go ā€œmedicaleseā€ -> ā€œmedical language normal people could understandā€ -> ā€œmedicaleseā€. Lots of other crazy ideas (particularly trying to think if I can include images), but these are some starting points.

Let me know if anyone wants to try (any of the above, or any better ideas you might have!)

8 Likes

Thatā€™s amazing. I would like to try the same, Please share the notebook for the web scraping. You should build an app after you finish modelling. It would be great for the society.

Awesome work @mcleavey! @RanjeetSingh for BeautifulSoup it seems this tutorial may help https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe. For my needs I also used request and I used http://lxml.de/ for scraping. A very short google search will produce many tutorials.

4 Likes

Thanks! Iā€™ll put up a link soon for the web scraping. Just to clarify - I was never expecting that this medical text generator itself would be able to generate real medical info (it often sounds good, but is obviously not medically accurate/reliable). Iā€™m hoping though to use this as a jumping off point for other projects & yes Iā€™m hoping that something will turn out to be useful for society :slight_smile:

2 Likes

On African Languages and Sentence Piece

ā€“ Since Africa has 2000+ languages, when I use ā€œAfricanā€ Iā€™m probably meaning ā€œBantuā€ languages, and by Bantu languages Iā€™m really thinking of ā€œNguniā€ languages, which is pretty much the languages of Southern Africa, in particular languages with ā€œclicksā€ā€“

Since there arenā€™t many NLP resources for many African languages, Iā€™ve been using Sentencepiece and a lot of web scraping for my chosen language of Xhosa. If I can achieve great results Iā€™d like to publish at the least a blog post on dealing with ā€œminorityā€ languages which donā€™t currently have language models, or great language models.

Integrating Sentencepiece into FastAI

This so far has been a hit and miss for me. Iā€™m able to tokenize the words (essentially what Spacy is doing). Sentencepiece provides two files after training on a corpus of text: A ā€œmodelā€ file and a ā€œvocabā€ file. The model file is used by sentencepiece to tokenize words into a list of IDs and a ā€œvocabā€ files seems to be a list of tokens and their float weights.

Iā€™m not sure how to take my ā€œvocabā€ file (list of tokens and associated weights) and treat it/pass it into FastAi.text as the expected ā€œh5ā€ file. For example FastAi.text is using PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'. Iā€™m unsure if I can instead plugin my vocab file without minimal changes to it. (I intend to look at FastAi.text library to see how it uses weights, and maybe that might provide insight). However if anyone using sentencepiece has gone through this process already (cc @jeremy) Iā€™d certainly appreciate your direction.

Memory with Sentencepiece and FastAI

My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isnā€™t specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.

I tried decreasing my batchsize (bs) in half, which didnā€™t help. Iā€™m not sure if I should keep the same embeddings, number of layers etc., as is: em_sz,nh,nl = 400,1150,3 and if that affects memory. At any rate I feel Iā€™m way beyond my scope of expertise here.

At any rate if I get this to work, I want to start looking into speech-to-text. Iā€™d like to crowdsource audio recordings of sentences from my corpus (which I scraped from Wikipedia, Twitter, Xhosa Bible, Online Xhosa books) from hundreds of Xhosa speakersā€¦ I have a friend from Senegal whoā€™s mother only speaks Wolof and therefore canā€™t use voice commands on her phone. I think tech can be localized/diversified more further if more people could use tech easily in languages they are comfortable and more familiar with.

10 Likes

I think the weights in vocab are different from the ones in a language model. The fwd_wt103.h5 weights are from the pretrained LM, so we can use it as a backbone with a custom head for any new text (as long as we match the vocab). I havenā€™t read the details on sentencepiece but I have a feeling that you cannot just use it like we did with the language model. What I did to train the LM was to use only the tokens.

My initial try was to use sentencepiece to segment the text (in Chinese), then use the spaCy English tokenizer to tokenize it. The training seemed working, since after segmentation the Chinese text is almost in the same structure as English where phrases are separated by space. However itā€™s definitely not kosher as the English tokenizer converts a lot of things particular to English, and ignores features in other languages. Iā€™m fixing it by using the sentencepiece tokens directly now. I know @binga seems to have also used the ā€œenā€ tokenizer from the lecture notes for Telugu initially but I donā€™t know how the classification results turned out.

My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isnā€™t specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.

How many tokens do you have in total (after segmentation)? I had to reduce the tokens from 400M to 100M to make it work with 32GB RAM. I suspect it could be a similar problem for you.

3 Likes

@sabzo, please please please write a blog post on how you did this when you are done.

Actually, everybody doing a non-Indo-European language please write a blog post on how you did this!

(I havenā€™t started the Esperanto yet, but I will, I will, really I willā€¦)

2 Likes

The output of a tokenizer (which is what we are using sentencepiece as) is simply:

  • A numericalized version of the corpus
  • A vocab (in our code we call this itos since thatā€™s what torchtext calls it)

The h5 file is entirely unrelated to this - the h5 file is a trained language model. It is trained from the numericalized corpus. So you need to take your sentencepiece ids, and train an LM using them.

(@binga @shoof) You shouldnā€™t generally be using spacy at all if youā€™re using sentencepiece.

4 Likes

I am working with a model for Finnish, and just started the training of a sentencepiece version. Havenā€™t seen results yet, so donā€™t know how well this will workā€¦ but at least the training got started.

This is what I did:

(1) First train the sentencepiece model [you have done that already, and you probably have a file called m.model somewhere]

(2) Run the following in the notebook

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/home/.../m.model")

After that you can encode text into numbers:

sp.EncodeAsIds("some text in your local language")

(3) you will need to get your sentencepiece tokenization into trn_lm using the encoder aboveā€¦ so something like

trn_sentencepiece = []
for i in range(0, len(df_trn)):
    trn_sentencepiece.append(sp.EncodeAsIds(df_trn.iloc[i]['text']))
    if i % 10000 == 0:
        print(i)
    
trn_lm = np.array(trn_sentencepiece)

(4) ā€¦then the same for validation setā€¦
(5) and then train the model like in the notebook (I guess you should use the vocab size (vs) you asked sentencepiece to generate)

trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH_LM, 1, vs, trn_dl, val_dl, bs = bs, bptt = bptt)
etc
15 Likes

This looks perfect - and almost certainly necessary to get good results in Finnish.

1 Like

Just released the language model of Bangla from the wikipedia corpus. Performance can certainly be improved, as I have barely scratched the surface.

4 Likes

See if @sguggerā€™s 1cycle params helpsā€¦

1 Like