This seems great work. Can you please share about how you collected and combined the data?
Thanks! I used two python libraries: requests and BeautifulSoup to loop through each page and extract the text & link to the next page. requests handles the html requests, and then BeautifulSoup gives you a way to extract parts of the page by tag. I can post a sample jupyter notebook for that if you like. I don’t know of a way to do this generally – I had to set up a loop for each overall site – one for Merck, one for MSF, etc. (Just to note - when I do something like this, I’m always extremely careful to make sure I haven’t created any infinite loops or anything else that would overwhelm the site I’m requesting from.)
At that point, I just saved files as text files (pulling some chunks over to a validation set), and ran them similarly to other language models. One thing is that the language model is quite sensitive to the proportion of text – ie: if I have 90% Merck manual, and 10% patient case studies, the generated text will sound much more like a Merck manual.
Some thoughts on what I’d like to try next: Adding a GAN (though I’m wondering how well the error signal can get back to the LSTM since it’ll be through the language generation step. Maybe a CNN generator is better??). Trying training the LSTM language model to predict the next 4 words rather than 1 (possibly a crazy though, but I’m wondering if I can make the model more accurate by making its training job more demanding). Finding a corpus of medical question/answer (possibly USMLE? ideally something with shorter/easier/more uniform questions). Using the model to do text->keywords (ie, give it a written radiology report and ask it to output a few labels). Another crazy thought, but wondering if I could do a cycle gan and go “medicalese” -> “medical language normal people could understand” -> “medicalese”. Lots of other crazy ideas (particularly trying to think if I can include images), but these are some starting points.
Let me know if anyone wants to try (any of the above, or any better ideas you might have!)
That’s amazing. I would like to try the same, Please share the notebook for the web scraping. You should build an app after you finish modelling. It would be great for the society.
Awesome work @mcleavey! @RanjeetSingh for BeautifulSoup it seems this tutorial may help https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe. For my needs I also used request and I used http://lxml.de/ for scraping. A very short google search will produce many tutorials.
Thanks! I’ll put up a link soon for the web scraping. Just to clarify - I was never expecting that this medical text generator itself would be able to generate real medical info (it often sounds good, but is obviously not medically accurate/reliable). I’m hoping though to use this as a jumping off point for other projects & yes I’m hoping that something will turn out to be useful for society
On African Languages and Sentence Piece
– Since Africa has 2000+ languages, when I use “African” I’m probably meaning “Bantu” languages, and by Bantu languages I’m really thinking of “Nguni” languages, which is pretty much the languages of Southern Africa, in particular languages with “clicks”–
Since there aren’t many NLP resources for many African languages, I’ve been using Sentencepiece and a lot of web scraping for my chosen language of Xhosa. If I can achieve great results I’d like to publish at the least a blog post on dealing with “minority” languages which don’t currently have language models, or great language models.
Integrating Sentencepiece into FastAI
This so far has been a hit and miss for me. I’m able to tokenize the words (essentially what Spacy is doing). Sentencepiece provides two files after training on a corpus of text: A “model” file and a “vocab” file. The model file is used by sentencepiece to tokenize words into a list of IDs and a “vocab” files seems to be a list of tokens and their float weights.
I’m not sure how to take my “vocab” file (list of tokens and associated weights) and treat it/pass it into FastAi.text as the expected “h5” file. For example FastAi.text is using
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'. I’m unsure if I can instead plugin my vocab file without minimal changes to it. (I intend to look at FastAi.text library to see how it uses weights, and maybe that might provide insight). However if anyone using sentencepiece has gone through this process already (cc @jeremy) I’d certainly appreciate your direction.
Memory with Sentencepiece and FastAI
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isn’t specific enough). However when I tried to train my model with
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
I tried decreasing my batchsize (bs) in half, which didn’t help. I’m not sure if I should keep the same embeddings, number of layers etc., as is:
em_sz,nh,nl = 400,1150,3 and if that affects memory. At any rate I feel I’m way beyond my scope of expertise here.
At any rate if I get this to work, I want to start looking into speech-to-text. I’d like to crowdsource audio recordings of sentences from my corpus (which I scraped from Wikipedia, Twitter, Xhosa Bible, Online Xhosa books) from hundreds of Xhosa speakers… I have a friend from Senegal who’s mother only speaks Wolof and therefore can’t use voice commands on her phone. I think tech can be localized/diversified more further if more people could use tech easily in languages they are comfortable and more familiar with.
I think the weights in vocab are different from the ones in a language model. The fwd_wt103.h5 weights are from the pretrained LM, so we can use it as a backbone with a custom head for any new text (as long as we match the vocab). I haven’t read the details on sentencepiece but I have a feeling that you cannot just use it like we did with the language model. What I did to train the LM was to use only the tokens.
My initial try was to use sentencepiece to segment the text (in Chinese), then use the spaCy English tokenizer to tokenize it. The training seemed working, since after segmentation the Chinese text is almost in the same structure as English where phrases are separated by space. However it’s definitely not kosher as the English tokenizer converts a lot of things particular to English, and ignores features in other languages. I’m fixing it by using the sentencepiece tokens directly now. I know @binga seems to have also used the “en” tokenizer from the lecture notes for Telugu initially but I don’t know how the classification results turned out.
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isn’t specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
How many tokens do you have in total (after segmentation)? I had to reduce the tokens from 400M to 100M to make it work with 32GB RAM. I suspect it could be a similar problem for you.
@sabzo, please please please write a blog post on how you did this when you are done.
Actually, everybody doing a non-Indo-European language please write a blog post on how you did this!
(I haven’t started the Esperanto yet, but I will, I will, really I will…)
The output of a tokenizer (which is what we are using sentencepiece as) is simply:
- A numericalized version of the corpus
- A vocab (in our code we call this
itossince that’s what torchtext calls it)
The h5 file is entirely unrelated to this - the h5 file is a trained language model. It is trained from the numericalized corpus. So you need to take your sentencepiece ids, and train an LM using them.
I am working with a model for Finnish, and just started the training of a sentencepiece version. Haven’t seen results yet, so don’t know how well this will work… but at least the training got started.
This is what I did:
(1) First train the sentencepiece model [you have done that already, and you probably have a file called m.model somewhere]
(2) Run the following in the notebook
import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("/home/.../m.model")
After that you can encode text into numbers:
sp.EncodeAsIds("some text in your local language")
(3) you will need to get your sentencepiece tokenization into
trn_lm using the encoder above… so something like
trn_sentencepiece =  for i in range(0, len(df_trn)): trn_sentencepiece.append(sp.EncodeAsIds(df_trn.iloc[i]['text'])) if i % 10000 == 0: print(i) trn_lm = np.array(trn_sentencepiece)
(4) …then the same for validation set…
(5) and then train the model like in the notebook (I guess you should use the vocab size (vs) you asked sentencepiece to generate)
trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt) val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt) md = LanguageModelData(PATH_LM, 1, vs, trn_dl, val_dl, bs = bs, bptt = bptt) etc
This looks perfect - and almost certainly necessary to get good results in Finnish.
Just released the language model of Bangla from the wikipedia corpus. Performance can certainly be improved, as I have barely scratched the surface.
See if @sgugger’s 1cycle params helps…
I haven’t started the Esperanto corpus yet and was wondering how long it is taking people to train their models (and on what hardware). How long did it take for you all?
For me one epoch takes around 1 hour 10 minutes. I have had 380.000 documents from Wikipedia as corpus.
Until now I have trained the models 10-15 epochs.
Hardware: 1080ti, 32 gb ram
Swahili : Progress so far.
I now have about 9.8 million words, with lots of nonsense from tesseract errors, mixed in with a number of single and double-letter OCR errors. I’ve used sentencepiece on a single file with 45560 words (calculated using
cat file.txt | wc -w). I used a vocabulary of 30000, and the results (from sentences obtained from the same file) are promising. Now I just need to use the sentencepiece model as the source for my tokenizer. I’m writing this one up in a blog post because I’m quite happy with how things look so far.
Edit: I’ve just started a process to build a tokenizer from all 144 files.
Update: 231 files now, 10,439,411 words.
Just wonder do you come across performance issues when running the above codes? Based on the
%time, it should be a quick process. However, it took over 5 minutes or so to run it for sample datasets (1000 items for training; 100 items for validation).
Most of them but not all. My teacher did not recommend one of them to us. My favor book is The Art of War (孫子兵法). From my knowledge, few MBA courses study it for business strategies as well.
%prun and google for ‘python profiler’.
Using AWS p3.2xlarge, each epoch takes just over an hour for 100 million+ tokens.
Note: AWS offers alert services, so I can get an email if CPU is running below certain level. Normally, that means something went wrong.