For isiXhosa Iām using tweets from South African news websites, this might work well too. For PDF to Text maybe this may help http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Using%20Python%20to%20Convert%20PDFs%20to%20Text%20Files.php ? Iām also curious have you looked into Congo, Kenya additionally to get a regional dataset of Swahili?
I was able to use sentencepiece to create a tokenizer for isiXhosa using a corpus of ~600,000 words from various Xhosa books, Xhosa wikipedia and Xhosa tweets. However, Iām not sure how to plug in sentencepiece files into fasta.ai training. Sentencepiece outputs a ā.modelā an a ā.vocabā files. @shoof Iād be interested in how you incorporated sentencepiece into FastAi. @jeremy Would you have any suggestions I can try?
Also Xhosa is a very agglutinative language, as a result my unique words before using sentencepiece Tokenizer are around ~200,000. For my vocab Iām using a limit of 32,000 (I know Jeremy recommended 60,000). However I wonder am I losing information since my unique count of words is such larger?
Thanks and Iād appreciate some help moving forward.
I tried pdfminer, but prefer the simpler scripting I have set up. See: extract.sh
Use the sentencepiece python module to create a list of ids for each of your docs. fastai.text works with lists of ids (i.e. weāre simply using sentencepiece to ānumericalizeā for us).
No youāre not losing information - itās still there, just split in to multiple tokens.
Just to share a few first results on the wiki103 set. French seems indeed an easier language for a LM!
I could still use learning rates up to 8 and for now Iāve done a cycle of length 2 and a cycle of length 4. For the cycle of length 4 I used half the droupouts you found best Jeremy, and was still underfitting. The results are:
- cycle of length 2: 3.97 validation loss (51.4 in perplexity)
- cycle of length 4: 3.69 validation loss (40 in perplexity)
Note that it took barely 3 hours and a half to train that second model on a Paperspace P6000, so thatās not too bad.
And to complete the loop, a cycle of length 10 gets an even better validation loss at 3.52 (33.78 perplexity), again with the dropouts halved.
Iām still slightly underfitting at the end of training (3.59 of training loss) so they could probably be reduced even more.
FYI I think you should try wikitext-2, using the full vocab (wikitext-2 already has all freq<3 words replaced with UNK, so it shouldnāt be too big). You might find you have something close to SoTA. If so, you could then implement the pointer-cache thing (just copy it from AWD-LSTM - theyāve already implemented it there) and you should find that you get SoTA. That would be pretty cool!
Oh also, look out for gradient clipping. I have it enabled in my sample code, but you may want to reduce it or remove it in these experiments, since it may get in the way of exploiting super-convergence.
Oh this is smaller, perfect for experimenting!
Iāll try a bit of all of this.
This seems great work. Can you please share about how you collected and combined the data?
Thanks! I used two python libraries: requests and BeautifulSoup to loop through each page and extract the text & link to the next page. requests handles the html requests, and then BeautifulSoup gives you a way to extract parts of the page by tag. I can post a sample jupyter notebook for that if you like. I donāt know of a way to do this generally ā I had to set up a loop for each overall site ā one for Merck, one for MSF, etc. (Just to note - when I do something like this, Iām always extremely careful to make sure I havenāt created any infinite loops or anything else that would overwhelm the site Iām requesting from.)
At that point, I just saved files as text files (pulling some chunks over to a validation set), and ran them similarly to other language models. One thing is that the language model is quite sensitive to the proportion of text ā ie: if I have 90% Merck manual, and 10% patient case studies, the generated text will sound much more like a Merck manual.
Some thoughts on what Iād like to try next: Adding a GAN (though Iām wondering how well the error signal can get back to the LSTM since itāll be through the language generation step. Maybe a CNN generator is better??). Trying training the LSTM language model to predict the next 4 words rather than 1 (possibly a crazy though, but Iām wondering if I can make the model more accurate by making its training job more demanding). Finding a corpus of medical question/answer (possibly USMLE? ideally something with shorter/easier/more uniform questions). Using the model to do text->keywords (ie, give it a written radiology report and ask it to output a few labels). Another crazy thought, but wondering if I could do a cycle gan and go āmedicaleseā -> āmedical language normal people could understandā -> āmedicaleseā. Lots of other crazy ideas (particularly trying to think if I can include images), but these are some starting points.
Let me know if anyone wants to try (any of the above, or any better ideas you might have!)
Thatās amazing. I would like to try the same, Please share the notebook for the web scraping. You should build an app after you finish modelling. It would be great for the society.
Awesome work @mcleavey! @RanjeetSingh for BeautifulSoup it seems this tutorial may help https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe. For my needs I also used request and I used http://lxml.de/ for scraping. A very short google search will produce many tutorials.
Thanks! Iāll put up a link soon for the web scraping. Just to clarify - I was never expecting that this medical text generator itself would be able to generate real medical info (it often sounds good, but is obviously not medically accurate/reliable). Iām hoping though to use this as a jumping off point for other projects & yes Iām hoping that something will turn out to be useful for society
On African Languages and Sentence Piece
ā Since Africa has 2000+ languages, when I use āAfricanā Iām probably meaning āBantuā languages, and by Bantu languages Iām really thinking of āNguniā languages, which is pretty much the languages of Southern Africa, in particular languages with āclicksāā
Since there arenāt many NLP resources for many African languages, Iāve been using Sentencepiece and a lot of web scraping for my chosen language of Xhosa. If I can achieve great results Iād like to publish at the least a blog post on dealing with āminorityā languages which donāt currently have language models, or great language models.
Integrating Sentencepiece into FastAI
This so far has been a hit and miss for me. Iām able to tokenize the words (essentially what Spacy is doing). Sentencepiece provides two files after training on a corpus of text: A āmodelā file and a āvocabā file. The model file is used by sentencepiece to tokenize words into a list of IDs and a āvocabā files seems to be a list of tokens and their float weights.
Iām not sure how to take my āvocabā file (list of tokens and associated weights) and treat it/pass it into FastAi.text as the expected āh5ā file. For example FastAi.text is using PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'
. Iām unsure if I can instead plugin my vocab file without minimal changes to it. (I intend to look at FastAi.text library to see how it uses weights, and maybe that might provide insight). However if anyone using sentencepiece has gone through this process already (cc @jeremy) Iād certainly appreciate your direction.
Memory with Sentencepiece and FastAI
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isnāt specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)
on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
I tried decreasing my batchsize (bs) in half, which didnāt help. Iām not sure if I should keep the same embeddings, number of layers etc., as is: em_sz,nh,nl = 400,1150,3
and if that affects memory. At any rate I feel Iām way beyond my scope of expertise here.
At any rate if I get this to work, I want to start looking into speech-to-text. Iād like to crowdsource audio recordings of sentences from my corpus (which I scraped from Wikipedia, Twitter, Xhosa Bible, Online Xhosa books) from hundreds of Xhosa speakersā¦ I have a friend from Senegal whoās mother only speaks Wolof and therefore canāt use voice commands on her phone. I think tech can be localized/diversified more further if more people could use tech easily in languages they are comfortable and more familiar with.
I think the weights in vocab are different from the ones in a language model. The fwd_wt103.h5 weights are from the pretrained LM, so we can use it as a backbone with a custom head for any new text (as long as we match the vocab). I havenāt read the details on sentencepiece but I have a feeling that you cannot just use it like we did with the language model. What I did to train the LM was to use only the tokens.
My initial try was to use sentencepiece to segment the text (in Chinese), then use the spaCy English tokenizer to tokenize it. The training seemed working, since after segmentation the Chinese text is almost in the same structure as English where phrases are separated by space. However itās definitely not kosher as the English tokenizer converts a lot of things particular to English, and ignores features in other languages. Iām fixing it by using the sentencepiece tokens directly now. I know @binga seems to have also used the āenā tokenizer from the lecture notes for Telugu initially but I donāt know how the classification results turned out.
My memory usage with sentencepiece to train on a document of about ~500,000 words was pretty low and fast. Less than a minute on my laptop (I know this isnāt specific enough). However when I tried to train my model with learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) on the same similar document I get out of memory errors on my laptop and also on an AWS p2.xlarge.
How many tokens do you have in total (after segmentation)? I had to reduce the tokens from 400M to 100M to make it work with 32GB RAM. I suspect it could be a similar problem for you.
@sabzo, please please please write a blog post on how you did this when you are done.
Actually, everybody doing a non-Indo-European language please write a blog post on how you did this!
(I havenāt started the Esperanto yet, but I will, I will, really I willā¦)
The output of a tokenizer (which is what we are using sentencepiece as) is simply:
- A numericalized version of the corpus
- A vocab (in our code we call this
itos
since thatās what torchtext calls it)
The h5 file is entirely unrelated to this - the h5 file is a trained language model. It is trained from the numericalized corpus. So you need to take your sentencepiece ids, and train an LM using them.
(@binga @shoof) You shouldnāt generally be using spacy at all if youāre using sentencepiece.
I am working with a model for Finnish, and just started the training of a sentencepiece version. Havenāt seen results yet, so donāt know how well this will workā¦ but at least the training got started.
This is what I did:
(1) First train the sentencepiece model [you have done that already, and you probably have a file called m.model somewhere]
(2) Run the following in the notebook
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/home/.../m.model")
After that you can encode text into numbers:
sp.EncodeAsIds("some text in your local language")
(3) you will need to get your sentencepiece tokenization into trn_lm
using the encoder aboveā¦ so something like
trn_sentencepiece = []
for i in range(0, len(df_trn)):
trn_sentencepiece.append(sp.EncodeAsIds(df_trn.iloc[i]['text']))
if i % 10000 == 0:
print(i)
trn_lm = np.array(trn_sentencepiece)
(4) ā¦then the same for validation setā¦
(5) and then train the model like in the notebook (I guess you should use the vocab size (vs) you asked sentencepiece to generate)
trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH_LM, 1, vs, trn_dl, val_dl, bs = bs, bptt = bptt)
etc
This looks perfect - and almost certainly necessary to get good results in Finnish.