Language Model Zoo 🦍

ramesh · April 11, 2018, 2:16pm

I did a quick search and found this corpora - https://github.com/cltk/sanskrit_text_wikisource. It may give you a starting point. You many need additional texts.

https://github.com/cltk has other language corpus too, but don’t know if that’s big enough for you.

shoof · April 11, 2018, 4:18pm

I used sentencepiece on simplified Chinese and it took me a few tries but eventually it worked. I’d suggest you to install the Python module, although the command-line tool is what the module calls (both work). I followed the Build and Install SentencePiece instruction and it had no issues.

The first you need to do after finishing installation is to train the SentencePiece model (a kind of language model? didn’t read the 3 papers cited but maybe I should…). The 2 key things to prepare are:

input file.
It is actually 1 text file that has all the content, 1 sentence per line. In my case, I used GNU cat to concatenate 836k files in 14 folders together, and it took less than a min. The resulting text file was ~2GB large.
vocabulary size
This one is a bit tricky as the number seems a bit arbitrary. I had tried a lower number like 200~300 on a small file to test it out, but it didn’t work. The vocab size cannot be too high either, e.g. higher than the total number of words (in my case characters) you have wouldn’t work. It requires you to have a relatively large corpus to make it work, and in my case I used 32000 (didn’t try higher) for a bit over 10M sentences. The default input size is at 10M but I didn’t bother changing it.

The training took about 5-10min (didn’t time it, but it was very efficient as my RAM usage was barely increased by 100MB and all CPU cores were at 100%.

Once the model is done, you can load it and use the segmenter to see results.

Good luck and let me know what you find!

shoof · April 11, 2018, 4:48pm

Yes and thanks to Jeremy’s suggestion I finally got the 1st run of the Chinese language model through. It’s still converging so I’ll update again once it stops I’m sharing my pain here and all of this could have been avoided if I had asked Jeremy earlier about the optimal # of tokens

I had 32GB of RAM and 400M tokens. Initially I tried to copy the files and follow the lecture notes, but it was painfully long on my 1T HDD, so I took Jeremy’s advice and loaded everything in dataframe instead. The steps were recorded here for folks who can’t use SSD right now.

After that I ran into RAM issue, I couldn’t load all the text using get_all. The process seemed still running but when I checked htop, it looked like this:
28%20PM
The swap quickly got maxed out and basically it would just hang their forever in the Jupyter notebook. I then ran it as a script and it threw an error eventually.

So, I had to modify the tokenization step from lecture notes to make it save the .npy file for every batch. I thought it solved the problem and it would be just easy to append them into a big list, but my last file maxed out my RAM. I tried to use a tuple instead of a list of a list to reduce the memory overhead, and it improved speed (quite surprisingly although no benchmark data), but didn’t help with loading my last file. I took this as an opportunity to dig deeper into the issue, and found the python memory_profiler very helpful. The following screenshot showed that the .npy file took almost 1GB of RAM whereas it was only 180MB on the disk!!!

I was going to search for a more memory-efficient method in Python, but then decided to first train on 100M tokens, which still had ~35k unique tokens. I used StratifiedShuffleSplit on my dataframe since the corpus I used has 14 imbalanced news classes. I’d imagine using lecture notes on my data would not give a good validation set since it’s for 2 balanced imdb classes. Many people are using wikipedia corpus and I imagine they won’t have such a problem. I’ll report back once my first big run is done. Right now my val_loss is at 4.47 (accuracy 0.299) for epoch 12 (took almost 48 hrs).

pandeyanil · April 11, 2018, 6:22pm

Here is another website for Samskritam (sanskrit) - https://sanskritdocuments.org/

jamsheer · April 11, 2018, 6:49pm

Dont know whether it works for you, but I found this tokenizer that supports multiple indian languages.

shoof · April 11, 2018, 6:59pm

In addition to recognizing special signs, what difference does this tokenizer make from using a regular string split and regex? (update: the source code looks like it does the tedious regex for you…)

mcleavey · April 11, 2018, 11:31pm

I love all the neat ideas here! I’ve been working on making a medical language model (I pulled a ton of text from the Merck manual, Doctors Without Borders, and several other online sources). So far I have it set up where I can prompt it with questions and it gives vaguely plausible (though obviously medically crazy) responses. I’m hoping to then use as a base for transferring to other medical text problems.

Let me know if anyone else is working on something like this, I’d love to partner up & exchange ideas!

Here’s what I have so far-- the generated text when I ask the model questions:

Me: “What is the prevalence of malaria?”
Model:
…a number (see the image below.). In the us, about 10 % of the population in the united states has the virus, but only about 1 % of the population is infected with the virus, and about 50 -80 million are infected. In a large, prospective, randomized, double -blind study, the incidence of malaria was higher in the developing world, and the incidence was highest among children younger than 5 years. [4, 7, 6]…

“What is the treatment for ectopic pregnancy?”

…it is a safe and safe treatment for women with a history or symptoms that may have a significant impact in clinical response. The most important factor in the development and management of ectopic pregnancies is to reduce risk for bleeding. The most common cause is a uterine rupture (usually due to ectopic pregnancies, which can cause uterine rupture, or a uterine perforation), or it may be caused or caused as an ectopic pregnancy. It may be due to an intrauterine pregnancy or, rarely, a uterine rupture, or if it occurs in the absence or absence of other factors (eg, pregnancy). Women with a pelvic inflammatory disorder may be at risk for ectopic pregnancy, which may be associated with an increased incidence of bleeding in the fetus…

“What is the treatment for diabetes?”

…it has been shown that the use of oral calcium channel blockers (eg, spironolactone) is not effective for the treatment and treatment of patients with diabetes. In addition, patients with diabetes, especially in the setting where the patient has a history of hypertension, should be given a statin to treat diabetes mellitus. In addition, patients should be instructed about the need for long term use. The most important approach is to prevent and manage diabetes mellitus, which can lead not to control hypertension and hypertension, as is a combination of these therapies, including a combination with oral contraceptives, insulin, or both. The use and efficacy in the management is not yet established…

“What drugs are effective for treating pneumonia?”

…or prolonged course of treatment with penicillin or a beta -lactam drug, and those with an acute respiratory syndrome (ards). In addition to the use, they are not routinely given for treatment of acute exacerbations of acute exacerbations, but the optimal dose and dose is unknown. For example, the dose is increased by 50 % if the dose of a drug has been increased to 100 mg po once / day. The usual dose of oral dosing is 10 to 15 g / kg. If the dose of the drug is low, a dose is given at a dose as possible to achieve a maximum of 2…

daveluo · April 12, 2018, 12:10am

Great idea! I haven’t really delved into the language modeling section yet but this is a very useful and interesting application. Would love to try it out too and share ideas.

ICYMI, I came across this as a potential additional corpus to train on…looks straightforward to scrape: http://mtsamples.com/

mcleavey · April 12, 2018, 12:32am

Thanks, that looks great. I’ll definitely add it in to the set. I’m debating switching back over to tensorflow so that I could more easily just turn this into an interactive webpage (as far as I know there’s no way to do that in pytorch?). In the meantime though, I’ll see about putting what I have so far up on github.

jeremy · April 12, 2018, 1:59am

@mcleavey are you on twitter? I posted this and one of the top researchers in medical NLP was interested in learning more about it. Would love to loop you into the discussion directly.

mcleavey · April 12, 2018, 2:03am

Yes - I’m also @mcleavey there. I’ll take a look…

sabzo · April 12, 2018, 6:21am

I’m unable to edit the wiki to add my name, but I’ve been working on a Xhosa language model since being inspired by Lesson 10. I found this thread by searching “sentencepiece” within the forums and it seems like the perfect tool since it’s hard to find a stemmer, tokenizer (and other nlp concepts I’m now learning about) for Xhosa.

jeremy · April 12, 2018, 2:28pm

What problem are you having when trying to edit the wiki? I want to make sure that works OK for everyone.

That’s great about Xhosa!

sabzo · April 12, 2018, 2:41pm

I was getting an error along the lines of “unable to tag 10 or more people”, I’m guessing to 10+ names mentioned in the wiki using “@”. However I was able to edit just now, so perhaps it was temporary.

And thanks!

jeremy · April 12, 2018, 3:48pm

It’s really easy to make it into an interactive web page. Just create a flask (or whatever) endpoint in the usual way. A search for ‘flask’ on the forums should give you the info you need, but feel free to create a new thread about creating a web app if you get stuck or want more pointers.

jeremy · April 12, 2018, 3:53pm

After you’ve created your language model, be sure to try a classification task (like we did with IMDB). Note that you don’t need many examples in order to do classification once you have a language model, so it’s fine to try with a small dataset.

Ideally, try to find a paper that’s already done NLP classification in your language, so you can benchmark against them on the same dataset. If you can’t find any examples (and be sure to search in both English and your language, since it may have been published in a local journal or an English language journal) then you can boast of being the first to release an NLP classifier in your language! So it’s a win either way

Finally, create a doc of some kind (readme in your github repo, medium post, academic paper, whatever you prefer) that shows the perplexity of your LM (ie np.exp(val_loss)) and the accuracy of your classifier.

If you need any help either with resources to train your LM (you can use my AWS instance) or help with the process, just ask here!

mcleavey · April 12, 2018, 4:04pm

Thanks, that sounds great. I’m out today, but I’ll take a look tonight!

jeremy · April 12, 2018, 4:13pm

BTW there isn’t any AWD LSTM in Tensorflow that I can find, so doing it in Tensorflow would require reimplementing the whole thing there - which would be quite a big project!

mcleavey · April 12, 2018, 4:17pm

Ok, thanks. I have an ex-coworker who’s now at Google Brain & invited me to meet him there, so I’m also looking for a good tensorflow project to put together. This sounds like it might be too big a scale though.

jeremy · April 12, 2018, 4:19pm

If you’re working closely with someone who deeply understands the internals of tensorflow it should be OK. Implementing the various types of dropout is the only tricky bit. Doing it in pytorch required digging into the internals a bit. Other than that, it’s just a standard LSTM.