Language Model Zoo 🦍

My machine was 32GB RAM. I would suggest to keep on reducing the chunk size, eventually one will work (it did for me :slight_smile:). But also save your progress so you don’t lose everything when (if) it crashes. I divided my trn_set into 12 and run the tokenizator on each of them, thus saving my progress.

But do not let go. The training part works fine if you can pass the tokenizer.


Yeah it’s running out of memory - your swap is full. Maybe try running on less cores? (Or even just run one one core?) I’m not sure why it’s using so much memory - I’m no expert at debugging Python memory issues. There’s a few memory debugging tips around, eg


Thanks Jeremy. I tried the single-core version proc_all while increased chunksize and it has the same issue of stalling after n_iter * batch size * text size per batch > RAM. I think the lists were just getting too big and I’m going around the problem by doing what @lesscomfortable did to save the list every n iterations incrementally, and then concat all at the end.


Good afternoon,

I started to attempt training a language model for korean as I planned to classify toxic comments.
I am currently using Konlpy for the tokeniser but sentencepiece suggested by Jeremy looks interesting as well.
I will try with what I have got at the moment first and update you. Thanks.

1 Like

A note to those folks building langage models: there’s no reason to go beyond 100 million tokens - in my experiments it didn’t help. So if your corpus is bigger than that, remove some of the smaller articles (for instance) until the corpus is down to that size. Really large corpuses are a pain to work with, and don’t have any benefits I saw.


To help you to get started, here is the procedure to download data from Wikipedia.

  1. Go to Wikimedia

  2. Click on the “Database backup dumps” (WikiDumps) link. (It took me a while to figure out it is a link!)

  3. There will be a long list inside the WikiDump. In this example, I pick ‘zh_yue’ for Cantonese (a subset of Chinese) and download it. (Warning: some of the file are very big)

  4. Git Clone from WikiExtractor (
    $ git clone

  5. Under WikiExtactor directory, then install it by typing
    (sudo) python install

  6. Syntax for extracting files into json format: -s --json -o {new_folder_name} {wikidumps_file_name}
    (Note: the {new_folder_name} will be created during extracting;
    more download options available under WikiExtractor readme)
    Example: $ -s --json -o cantonese zh_yuewiki-20180401-pages-meta-current.xml.bz2


Hi everyone, I wanna work on Sanskrit Language but I am not finding useful sources to download the data from. Also there isn’t any suitable Tokenizer that I know of as of now. Please guide me to appropriate resources if somebody know.
Also for the tokenization I am thinking to use Sentencepiece that @jeremy mentioned in Lesson 10. I have gone through the github page but I am unable to figure out how it works ( I am not good with programming and command lines …:sweat_smile: ) . If anybody has tried it out please shine some light on its usage.

I did a quick search and found this corpora - It may give you a starting point. You many need additional texts. has other language corpus too, but don’t know if that’s big enough for you.

1 Like

I used sentencepiece on simplified Chinese and it took me a few tries but eventually it worked. I’d suggest you to install the Python module, although the command-line tool is what the module calls (both work). I followed the Build and Install SentencePiece instruction and it had no issues.

The first you need to do after finishing installation is to train the SentencePiece model (a kind of language model? didn’t read the 3 papers cited but maybe I should…). The 2 key things to prepare are:

  1. input file.
    It is actually 1 text file that has all the content, 1 sentence per line. In my case, I used GNU cat to concatenate 836k files in 14 folders together, and it took less than a min. The resulting text file was ~2GB large.

  2. vocabulary size
    This one is a bit tricky as the number seems a bit arbitrary. I had tried a lower number like 200~300 on a small file to test it out, but it didn’t work. The vocab size cannot be too high either, e.g. higher than the total number of words (in my case characters) you have wouldn’t work. It requires you to have a relatively large corpus to make it work, and in my case I used 32000 (didn’t try higher) for a bit over 10M sentences. The default input size is at 10M but I didn’t bother changing it.

The training took about 5-10min (didn’t time it, but it was very efficient as my RAM usage was barely increased by 100MB and all CPU cores were at 100%.

Once the model is done, you can load it and use the segmenter to see results.

Good luck and let me know what you find! :grin:


Yes and thanks to Jeremy’s suggestion I finally got the 1st run of the Chinese language model through. It’s still converging so I’ll update again once it stops :grinning: I’m sharing my pain here and all of this could have been avoided if I had asked Jeremy earlier about the optimal # of tokens :frowning_face:

I had 32GB of RAM and 400M tokens. Initially I tried to copy the files and follow the lecture notes, but it was painfully long on my 1T HDD, so I took Jeremy’s advice and loaded everything in dataframe instead. The steps were recorded here for folks who can’t use SSD right now.

After that I ran into RAM issue, I couldn’t load all the text using get_all. The process seemed still running but when I checked htop, it looked like this:
The swap quickly got maxed out and basically it would just hang their forever in the Jupyter notebook. I then ran it as a script and it threw an error eventually.

So, I had to modify the tokenization step from lecture notes to make it save the .npy file for every batch. I thought it solved the problem and it would be just easy to append them into a big list, but my last file maxed out my RAM. I tried to use a tuple instead of a list of a list to reduce the memory overhead, and it improved speed (quite surprisingly although no benchmark data), but didn’t help with loading my last file. I took this as an opportunity to dig deeper into the issue, and found the python memory_profiler very helpful. The following screenshot showed that the .npy file took almost 1GB of RAM whereas it was only 180MB on the disk!!!

I was going to search for a more memory-efficient method in Python, but then decided to first train on 100M tokens, which still had ~35k unique tokens. I used StratifiedShuffleSplit on my dataframe since the corpus I used has 14 imbalanced news classes. I’d imagine using lecture notes on my data would not give a good validation set since it’s for 2 balanced imdb classes. Many people are using wikipedia corpus and I imagine they won’t have such a problem. I’ll report back once my first big run is done. Right now my val_loss is at 4.47 (accuracy 0.299) for epoch 12 (took almost 48 hrs).


Here is another website for Samskritam (sanskrit) -

1 Like

Dont know whether it works for you, but I found this tokenizer that supports multiple indian languages.


In addition to recognizing special signs, what difference does this tokenizer make from using a regular string split and regex? (update: the source code looks like it does the tedious regex for you…)

I love all the neat ideas here! I’ve been working on making a medical language model (I pulled a ton of text from the Merck manual, Doctors Without Borders, and several other online sources). So far I have it set up where I can prompt it with questions and it gives vaguely plausible (though obviously medically crazy) responses. I’m hoping to then use as a base for transferring to other medical text problems.

Let me know if anyone else is working on something like this, I’d love to partner up & exchange ideas!

Here’s what I have so far-- the generated text when I ask the model questions:

Me: “What is the prevalence of malaria?”
…a number (see the image below.). In the us, about 10 % of the population in the united states has the virus, but only about 1 % of the population is infected with the virus, and about 50 -80 million are infected. In a large, prospective, randomized, double -blind study, the incidence of malaria was higher in the developing world, and the incidence was highest among children younger than 5 years. [4, 7, 6]…

“What is the treatment for ectopic pregnancy?”

…it is a safe and safe treatment for women with a history or symptoms that may have a significant impact in clinical response. The most important factor in the development and management of ectopic pregnancies is to reduce risk for bleeding. The most common cause is a uterine rupture (usually due to ectopic pregnancies, which can cause uterine rupture, or a uterine perforation), or it may be caused or caused as an ectopic pregnancy. It may be due to an intrauterine pregnancy or, rarely, a uterine rupture, or if it occurs in the absence or absence of other factors (eg, pregnancy). Women with a pelvic inflammatory disorder may be at risk for ectopic pregnancy, which may be associated with an increased incidence of bleeding in the fetus…

“What is the treatment for diabetes?”

…it has been shown that the use of oral calcium channel blockers (eg, spironolactone) is not effective for the treatment and treatment of patients with diabetes. In addition, patients with diabetes, especially in the setting where the patient has a history of hypertension, should be given a statin to treat diabetes mellitus. In addition, patients should be instructed about the need for long term use. The most important approach is to prevent and manage diabetes mellitus, which can lead not to control hypertension and hypertension, as is a combination of these therapies, including a combination with oral contraceptives, insulin, or both. The use and efficacy in the management is not yet established…

“What drugs are effective for treating pneumonia?”

…or prolonged course of treatment with penicillin or a beta -lactam drug, and those with an acute respiratory syndrome (ards). In addition to the use, they are not routinely given for treatment of acute exacerbations of acute exacerbations, but the optimal dose and dose is unknown. For example, the dose is increased by 50 % if the dose of a drug has been increased to 100 mg po once / day. The usual dose of oral dosing is 10 to 15 g / kg. If the dose of the drug is low, a dose is given at a dose as possible to achieve a maximum of 2…


Great idea! I haven’t really delved into the language modeling section yet but this is a very useful and interesting application. Would love to try it out too and share ideas.

ICYMI, I came across this as a potential additional corpus to train on…looks straightforward to scrape:


Thanks, that looks great. I’ll definitely add it in to the set. I’m debating switching back over to tensorflow so that I could more easily just turn this into an interactive webpage (as far as I know there’s no way to do that in pytorch?). In the meantime though, I’ll see about putting what I have so far up on github.

1 Like

@mcleavey are you on twitter? I posted this and one of the top researchers in medical NLP was interested in learning more about it. Would love to loop you into the discussion directly.

1 Like

Yes - I’m also @mcleavey there. I’ll take a look…

I’m unable to edit the wiki to add my name, but I’ve been working on a Xhosa language model since being inspired by Lesson 10. I found this thread by searching “sentencepiece” within the forums and it seems like the perfect tool since it’s hard to find a stemmer, tokenizer (and other nlp concepts I’m now learning about) for Xhosa.


What problem are you having when trying to edit the wiki? I want to make sure that works OK for everyone.

That’s great about Xhosa!