For isiXhosa I’m using tweets from South African news websites, this might work well too. For PDF to Text maybe this may help http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Using%20Python%20to%20Convert%20PDFs%20to%20Text%20Files.php ? I’m also curious have you looked into Congo, Kenya additionally to get a regional dataset of Swahili?
I was able to use sentencepiece to create a tokenizer for isiXhosa using a corpus of ~600,000 words from various Xhosa books, Xhosa wikipedia and Xhosa tweets. However, I’m not sure how to plug in sentencepiece files into fasta.ai training. Sentencepiece outputs a “.model” an a “.vocab” files. @shoof I’d be interested in how you incorporated sentencepiece into FastAi. @jeremy Would you have any suggestions I can try?
Also Xhosa is a very agglutinative language, as a result my unique words before using sentencepiece Tokenizer are around ~200,000. For my vocab I’m using a limit of 32,000 (I know Jeremy recommended 60,000). However I wonder am I losing information since my unique count of words is such larger?
Thanks and I’d appreciate some help moving forward.
I tried pdfminer, but prefer the simpler scripting I have set up. See: extract.sh
Use the sentencepiece python module to create a list of ids for each of your docs. fastai.text works with lists of ids (i.e. we’re simply using sentencepiece to ‘numericalize’ for us).
No you’re not losing information - it’s still there, just split in to multiple tokens.
Just to share a few first results on the wiki103 set. French seems indeed an easier language for a LM!
I could still use learning rates up to 8 and for now I’ve done a cycle of length 2 and a cycle of length 4. For the cycle of length 4 I used half the droupouts you found best Jeremy, and was still underfitting. The results are:
- cycle of length 2: 3.97 validation loss (51.4 in perplexity)
- cycle of length 4: 3.69 validation loss (40 in perplexity)
Note that it took barely 3 hours and a half to train that second model on a Paperspace P6000, so that’s not too bad.
And to complete the loop, a cycle of length 10 gets an even better validation loss at 3.52 (33.78 perplexity), again with the dropouts halved.
I’m still slightly underfitting at the end of training (3.59 of training loss) so they could probably be reduced even more.
FYI I think you should try wikitext-2, using the full vocab (wikitext-2 already has all freq<3 words replaced with UNK, so it shouldn’t be too big). You might find you have something close to SoTA. If so, you could then implement the pointer-cache thing (just copy it from AWD-LSTM - they’ve already implemented it there) and you should find that you get SoTA. That would be pretty cool!
Oh also, look out for gradient clipping. I have it enabled in my sample code, but you may want to reduce it or remove it in these experiments, since it may get in the way of exploiting super-convergence.
Using use_clr_beta and new plotting tools
Oh this is smaller, perfect for experimenting!
I’ll try a bit of all of this.
This seems great work. Can you please share about how you collected and combined the data?
Thanks! I used two python libraries: requests and BeautifulSoup to loop through each page and extract the text & link to the next page. requests handles the html requests, and then BeautifulSoup gives you a way to extract parts of the page by tag. I can post a sample jupyter notebook for that if you like. I don’t know of a way to do this generally – I had to set up a loop for each overall site – one for Merck, one for MSF, etc. (Just to note - when I do something like this, I’m always extremely careful to make sure I haven’t created any infinite loops or anything else that would overwhelm the site I’m requesting from.)
At that point, I just saved files as text files (pulling some chunks over to a validation set), and ran them similarly to other language models. One thing is that the language model is quite sensitive to the proportion of text – ie: if I have 90% Merck manual, and 10% patient case studies, the generated text will sound much more like a Merck manual.
Some thoughts on what I’d like to try next: Adding a GAN (though I’m wondering how well the error signal can get back to the LSTM since it’ll be through the language generation step. Maybe a CNN generator is better??). Trying training the LSTM language model to predict the next 4 words rather than 1 (possibly a crazy though, but I’m wondering if I can make the model more accurate by making its training job more demanding). Finding a corpus of medical question/answer (possibly USMLE? ideally something with shorter/easier/more uniform questions). Using the model to do text->keywords (ie, give it a written radiology report and ask it to output a few labels). Another crazy thought, but wondering if I could do a cycle gan and go “medicalese” -> “medical language normal people could understand” -> “medicalese”. Lots of other crazy ideas (particularly trying to think if I can include images), but these are some starting points.
Let me know if anyone wants to try (any of the above, or any better ideas you might have!)
That’s amazing. I would like to try the same, Please share the notebook for the web scraping. You should build an app after you finish modelling. It would be great for the society.
Awesome work @mcleavey! @RanjeetSingh for BeautifulSoup it seems this tutorial may help https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe. For my needs I also used request and I used http://lxml.de/ for scraping. A very short google search will produce many tutorials.
Thanks! I’ll put up a link soon for the web scraping. Just to clarify - I was never expecting that this medical text generator itself would be able to generate real medical info (it often sounds good, but is obviously not medically accurate/reliable). I’m hoping though to use this as a jumping off point for other projects & yes I’m hoping that something will turn out to be useful for society
On African Languages and Sentence Piece
– Since Africa has 2000+ languages, when I use “African” I’m probably meaning “Bantu” languages, and by Bantu languages I’m really thinking of “Nguni” languages, which is pretty much the languages of Southern Africa, in particular languages with “clicks”–
Since there aren’t many NLP resources for many African languages, I’ve been using Sentencepiece and a lot of web scraping for my chosen language of Xhosa. If I can achieve great results I’d like to publish at the least a blog post on dealing with “minority” languages which don’t currently have language models, or great language models.
Integrating Sentencepiece into FastAI
This so far has been a hit and miss for me. I’m able to tokenize the words (essentially what Spacy is doing). Sentencepiece provides two files after training on a corpus of text: A “model” file and a “vocab” file. The model file is used by sentencepiece to tokenize words into a list of IDs and a “vocab” files seems to be a list of tokens and their float weights.
I’m not sure how to take my “vocab” file (list of tokens and associated weights) and treat it/pass it into FastAi.text as the expected “h5” file. For example FastAi.text is using
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'. I’m unsure if I can instead plugin my vocab file without minimal changes to it. (I intend to look at FastAi.text library to see how it uses weights, and maybe that might provide insight). However if anyone using sentencepiece has gone through this process already (cc @jeremy) I’d certainly appreciate your direction.
Memory with Sentencepiece and FastAI
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isn’t specific enough). However when I tried to train my model with
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
I tried decreasing my batchsize (bs) in half, which didn’t help. I’m not sure if I should keep the same embeddings, number of layers etc., as is:
em_sz,nh,nl = 400,1150,3 and if that affects memory. At any rate I feel I’m way beyond my scope of expertise here.
At any rate if I get this to work, I want to start looking into speech-to-text. I’d like to crowdsource audio recordings of sentences from my corpus (which I scraped from Wikipedia, Twitter, Xhosa Bible, Online Xhosa books) from hundreds of Xhosa speakers… I have a friend from Senegal who’s mother only speaks Wolof and therefore can’t use voice commands on her phone. I think tech can be localized/diversified more further if more people could use tech easily in languages they are comfortable and more familiar with.
I think the weights in vocab are different from the ones in a language model. The fwd_wt103.h5 weights are from the pretrained LM, so we can use it as a backbone with a custom head for any new text (as long as we match the vocab). I haven’t read the details on sentencepiece but I have a feeling that you cannot just use it like we did with the language model. What I did to train the LM was to use only the tokens.
My initial try was to use sentencepiece to segment the text (in Chinese), then use the spaCy English tokenizer to tokenize it. The training seemed working, since after segmentation the Chinese text is almost in the same structure as English where phrases are separated by space. However it’s definitely not kosher as the English tokenizer converts a lot of things particular to English, and ignores features in other languages. I’m fixing it by using the sentencepiece tokens directly now. I know @binga seems to have also used the “en” tokenizer from the lecture notes for Telugu initially but I don’t know how the classification results turned out.
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isn’t specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
How many tokens do you have in total (after segmentation)? I had to reduce the tokens from 400M to 100M to make it work with 32GB RAM. I suspect it could be a similar problem for you.
@sabzo, please please please write a blog post on how you did this when you are done.
Actually, everybody doing a non-Indo-European language please write a blog post on how you did this!
(I haven’t started the Esperanto yet, but I will, I will, really I will…)
The output of a tokenizer (which is what we are using sentencepiece as) is simply:
- A numericalized version of the corpus
- A vocab (in our code we call this
itossince that’s what torchtext calls it)
The h5 file is entirely unrelated to this - the h5 file is a trained language model. It is trained from the numericalized corpus. So you need to take your sentencepiece ids, and train an LM using them.
I am working with a model for Finnish, and just started the training of a sentencepiece version. Haven’t seen results yet, so don’t know how well this will work… but at least the training got started.
This is what I did:
(1) First train the sentencepiece model [you have done that already, and you probably have a file called m.model somewhere]
(2) Run the following in the notebook
import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("/home/.../m.model")
After that you can encode text into numbers:
sp.EncodeAsIds("some text in your local language")
(3) you will need to get your sentencepiece tokenization into
trn_lm using the encoder above… so something like
trn_sentencepiece =  for i in range(0, len(df_trn)): trn_sentencepiece.append(sp.EncodeAsIds(df_trn.iloc[i]['text'])) if i % 10000 == 0: print(i) trn_lm = np.array(trn_sentencepiece)
(4) …then the same for validation set…
(5) and then train the model like in the notebook (I guess you should use the vocab size (vs) you asked sentencepiece to generate)
trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt) val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt) md = LanguageModelData(PATH_LM, 1, vs, trn_dl, val_dl, bs = bs, bptt = bptt) etc
This looks perfect - and almost certainly necessary to get good results in Finnish.