Language Model Zoo šŸ¦

A note to those folks building langage models: thereā€™s no reason to go beyond 100 million tokens - in my experiments it didnā€™t help. So if your corpus is bigger than that, remove some of the smaller articles (for instance) until the corpus is down to that size. Really large corpuses are a pain to work with, and donā€™t have any benefits I saw.

11 Likes

To help you to get started, here is the procedure to download data from Wikipedia.

  1. Go to Wikimedia https://dumps.wikimedia.org/

  2. Click on the ā€œDatabase backup dumpsā€ (WikiDumps) link. (It took me a while to figure out it is a link!)
    image

  3. There will be a long list inside the WikiDump. In this example, I pick ā€˜zh_yueā€™ for Cantonese (a subset of Chinese) and download it. (Warning: some of the file are very big)
    image

  4. Git Clone from WikiExtractor (https://github.com/attardi/wikiextractor)
    $ git clone https://github.com/attardi/wikiextractor.git

  5. Under WikiExtactor directory, then install it by typing
    (sudo) python setup.py install

  6. Syntax for extracting files into json format:
    WikiExtractor.py -s --json -o {new_folder_name} {wikidumps_file_name}
    (Note: the {new_folder_name} will be created during extracting;
    more download options available under WikiExtractor readme)
    Example: $ WikiExtractor.py -s --json -o cantonese zh_yuewiki-20180401-pages-meta-current.xml.bz2

25 Likes

Hi everyone, I wanna work on Sanskrit Language but I am not finding useful sources to download the data from. Also there isnā€™t any suitable Tokenizer that I know of as of now. Please guide me to appropriate resources if somebody know.
Also for the tokenization I am thinking to use Sentencepiece that @jeremy mentioned in Lesson 10. I have gone through the github page but I am unable to figure out how it works ( I am not good with programming and command lines ā€¦:sweat_smile: ) . If anybody has tried it out please shine some light on its usage.

I did a quick search and found this corpora - https://github.com/cltk/sanskrit_text_wikisource. It may give you a starting point. You many need additional texts.

https://github.com/cltk has other language corpus too, but donā€™t know if thatā€™s big enough for you.

1 Like

I used sentencepiece on simplified Chinese and it took me a few tries but eventually it worked. Iā€™d suggest you to install the Python module, although the command-line tool is what the module calls (both work). I followed the Build and Install SentencePiece instruction and it had no issues.

The first you need to do after finishing installation is to train the SentencePiece model (a kind of language model? didnā€™t read the 3 papers cited but maybe I shouldā€¦). The 2 key things to prepare are:

  1. input file.
    It is actually 1 text file that has all the content, 1 sentence per line. In my case, I used GNU cat to concatenate 836k files in 14 folders together, and it took less than a min. The resulting text file was ~2GB large.

  2. vocabulary size
    This one is a bit tricky as the number seems a bit arbitrary. I had tried a lower number like 200~300 on a small file to test it out, but it didnā€™t work. The vocab size cannot be too high either, e.g. higher than the total number of words (in my case characters) you have wouldnā€™t work. It requires you to have a relatively large corpus to make it work, and in my case I used 32000 (didnā€™t try higher) for a bit over 10M sentences. The default input size is at 10M but I didnā€™t bother changing it.

The training took about 5-10min (didnā€™t time it, but it was very efficient as my RAM usage was barely increased by 100MB and all CPU cores were at 100%.

Once the model is done, you can load it and use the segmenter to see results.

Good luck and let me know what you find! :grin:

3 Likes

Yes and thanks to Jeremyā€™s suggestion I finally got the 1st run of the Chinese language model through. Itā€™s still converging so Iā€™ll update again once it stops :grinning: Iā€™m sharing my pain here and all of this could have been avoided if I had asked Jeremy earlier about the optimal # of tokens :frowning_face:

I had 32GB of RAM and 400M tokens. Initially I tried to copy the files and follow the lecture notes, but it was painfully long on my 1T HDD, so I took Jeremyā€™s advice and loaded everything in dataframe instead. The steps were recorded here for folks who canā€™t use SSD right now.

After that I ran into RAM issue, I couldnā€™t load all the text using get_all. The process seemed still running but when I checked htop, it looked like this:
28%20PM
The swap quickly got maxed out and basically it would just hang their forever in the Jupyter notebook. I then ran it as a script and it threw an error eventually.

So, I had to modify the tokenization step from lecture notes to make it save the .npy file for every batch. I thought it solved the problem and it would be just easy to append them into a big list, but my last file maxed out my RAM. I tried to use a tuple instead of a list of a list to reduce the memory overhead, and it improved speed (quite surprisingly although no benchmark data), but didnā€™t help with loading my last file. I took this as an opportunity to dig deeper into the issue, and found the python memory_profiler very helpful. The following screenshot showed that the .npy file took almost 1GB of RAM whereas it was only 180MB on the disk!!!

I was going to search for a more memory-efficient method in Python, but then decided to first train on 100M tokens, which still had ~35k unique tokens. I used StratifiedShuffleSplit on my dataframe since the corpus I used has 14 imbalanced news classes. Iā€™d imagine using lecture notes on my data would not give a good validation set since itā€™s for 2 balanced imdb classes. Many people are using wikipedia corpus and I imagine they wonā€™t have such a problem. Iā€™ll report back once my first big run is done. Right now my val_loss is at 4.47 (accuracy 0.299) for epoch 12 (took almost 48 hrs).

7 Likes

Here is another website for Samskritam (sanskrit) - https://sanskritdocuments.org/

1 Like

Dont know whether it works for you, but I found this tokenizer that supports multiple indian languages.

3 Likes

In addition to recognizing special signs, what difference does this tokenizer make from using a regular string split and regex? (update: the source code looks like it does the tedious regex for youā€¦)

I love all the neat ideas here! Iā€™ve been working on making a medical language model (I pulled a ton of text from the Merck manual, Doctors Without Borders, and several other online sources). So far I have it set up where I can prompt it with questions and it gives vaguely plausible (though obviously medically crazy) responses. Iā€™m hoping to then use as a base for transferring to other medical text problems.

Let me know if anyone else is working on something like this, Iā€™d love to partner up & exchange ideas!

Hereā€™s what I have so far-- the generated text when I ask the model questions:

Me: ā€œWhat is the prevalence of malaria?ā€
Model:
ā€¦a number (see the image below.). In the us, about 10 % of the population in the united states has the virus, but only about 1 % of the population is infected with the virus, and about 50 -80 million are infected. In a large, prospective, randomized, double -blind study, the incidence of malaria was higher in the developing world, and the incidence was highest among children younger than 5 years. [4, 7, 6]ā€¦

ā€œWhat is the treatment for ectopic pregnancy?ā€

ā€¦it is a safe and safe treatment for women with a history or symptoms that may have a significant impact in clinical response. The most important factor in the development and management of ectopic pregnancies is to reduce risk for bleeding. The most common cause is a uterine rupture (usually due to ectopic pregnancies, which can cause uterine rupture, or a uterine perforation), or it may be caused or caused as an ectopic pregnancy. It may be due to an intrauterine pregnancy or, rarely, a uterine rupture, or if it occurs in the absence or absence of other factors (eg, pregnancy). Women with a pelvic inflammatory disorder may be at risk for ectopic pregnancy, which may be associated with an increased incidence of bleeding in the fetusā€¦

ā€œWhat is the treatment for diabetes?ā€

ā€¦it has been shown that the use of oral calcium channel blockers (eg, spironolactone) is not effective for the treatment and treatment of patients with diabetes. In addition, patients with diabetes, especially in the setting where the patient has a history of hypertension, should be given a statin to treat diabetes mellitus. In addition, patients should be instructed about the need for long term use. The most important approach is to prevent and manage diabetes mellitus, which can lead not to control hypertension and hypertension, as is a combination of these therapies, including a combination with oral contraceptives, insulin, or both. The use and efficacy in the management is not yet establishedā€¦

ā€œWhat drugs are effective for treating pneumonia?ā€

ā€¦or prolonged course of treatment with penicillin or a beta -lactam drug, and those with an acute respiratory syndrome (ards). In addition to the use, they are not routinely given for treatment of acute exacerbations of acute exacerbations, but the optimal dose and dose is unknown. For example, the dose is increased by 50 % if the dose of a drug has been increased to 100 mg po once / day. The usual dose of oral dosing is 10 to 15 g / kg. If the dose of the drug is low, a dose is given at a dose as possible to achieve a maximum of 2ā€¦

21 Likes

Great idea! I havenā€™t really delved into the language modeling section yet but this is a very useful and interesting application. Would love to try it out too and share ideas.

ICYMI, I came across this as a potential additional corpus to train onā€¦looks straightforward to scrape: http://mtsamples.com/

2 Likes

Thanks, that looks great. Iā€™ll definitely add it in to the set. Iā€™m debating switching back over to tensorflow so that I could more easily just turn this into an interactive webpage (as far as I know thereā€™s no way to do that in pytorch?). In the meantime though, Iā€™ll see about putting what I have so far up on github.

1 Like

@mcleavey are you on twitter? I posted this and one of the top researchers in medical NLP was interested in learning more about it. Would love to loop you into the discussion directly.

1 Like

Yes - Iā€™m also @mcleavey there. Iā€™ll take a lookā€¦

Iā€™m unable to edit the wiki to add my name, but Iā€™ve been working on a Xhosa language model since being inspired by Lesson 10. I found this thread by searching ā€œsentencepieceā€ within the forums and it seems like the perfect tool since itā€™s hard to find a stemmer, tokenizer (and other nlp concepts Iā€™m now learning about) for Xhosa.

2 Likes

What problem are you having when trying to edit the wiki? I want to make sure that works OK for everyone.

Thatā€™s great about Xhosa!

I was getting an error along the lines of ā€œunable to tag 10 or more peopleā€, Iā€™m guessing to 10+ names mentioned in the wiki using ā€œ@ā€. However I was able to edit just now, so perhaps it was temporary.

And thanks!

1 Like

Itā€™s really easy to make it into an interactive web page. Just create a flask (or whatever) endpoint in the usual way. A search for ā€˜flaskā€™ on the forums should give you the info you need, but feel free to create a new thread about creating a web app if you get stuck or want more pointers.

3 Likes

After youā€™ve created your language model, be sure to try a classification task (like we did with IMDB). Note that you donā€™t need many examples in order to do classification once you have a language model, so itā€™s fine to try with a small dataset.

Ideally, try to find a paper thatā€™s already done NLP classification in your language, so you can benchmark against them on the same dataset. If you canā€™t find any examples (and be sure to search in both English and your language, since it may have been published in a local journal or an English language journal) then you can boast of being the first to release an NLP classifier in your language! So itā€™s a win either way :slight_smile:

Finally, create a doc of some kind (readme in your github repo, medium post, academic paper, whatever you prefer) that shows the perplexity of your LM (ie np.exp(val_loss)) and the accuracy of your classifier.

If you need any help either with resources to train your LM (you can use my AWS instance) or help with the process, just ask here!

8 Likes

Thanks, that sounds great. Iā€™m out today, but Iā€™ll take a look tonight!