Language Model Zoo šŸ¦

This threadā€™s objective is to discuss the ULMFiT implementations in different languages and share our roadblocks and approaches.

A word to newcomers

This thread is huge, and can be overwhelmed, but there is an easy way to get started.

how to contribute

If you want to participate, simply:

  • pick the language
  • start a new thread ā€œULMFIT - languageā€,
  • copy what is currently in this message to your new thread
  • inform the person that was previously working on the language that you want to participate
  • place a link to this message
  • and finally post a message to everyone at the bottom of this thread, so ppl can join

There is still plenty of languages to tackle so far we beat SOTA for :
Thai, Polish, German, Indonesian, Hindi, Malay. (AFAIK)
WIP:
French [high acc, no baseline yet], Portuguses [high acc, no baseline yet] , Chinese

Languages and people:

Arabic: bachir - article / notebook

Bengali: nahidalam , Aisha

Chinese (Simplified):

Chinese (Traditional): :panda_face: Moody - Paper

Czech: simecek - Weights

Danish: mollerhoj - Weights

Esperanto: ducky

Estonian: Priit

Filipino (Tagalog): Hadrian - Project

Finnish: Antti Weights

French:

German:

Gujarati: disisbig

Hebrew: shgidi

Hindi: pandeyanil, nirantk, disisbig

Italian: RnD, DavideBoschetto

Indonesian: Cahya - source code

Japanese: Hiromi
Shun

Kannada: disisbig

Korean: yjp, sooheon Paper to benchmark

Malay: previous conversation can be found here

Malayalam: jamsheer, disisbig

Medical: mcleavey

Music: mcleavey (generating music in the style of Mozart & Brahms)

Nepali: disisbig

Norwegian: mollerhoj Weights

Persian: nasrin and bensums ,
insightfactory: weights
(ULMFIT - Persian, Farsi)

Polish:

Portuguese: saulberardo, monilouise, NandoBr

Punjabi: disisbig

Russian: Pavel Pleskov - source code
Alexey Demyanchuk ULMFiT - Russian

Sanskrit: Vishucyrus, pandeyanil (later), disisbig

Serbian:

Singlish: cedric

Spanish: :ox: lesscomfortable - source code Adriana William German

Swahili: :lion: Brian

Swedish: mollerhoj Weights

Tamil: ananda_seelan

Telugu: binga - Source Code

Thai:

Turkish:

Ukrainian: snakonechny

isiXhosa: Sabelo

This is a Wiki, please add you name (via hyperlink not @user_name) and the language you are working on by alphabetical order. Feel free to form a group to discuss your language specific problems as well.

Tips:

46 Likes

Currently Hindi and later sanskrit

1 Like

Iā€™ve made the top post a wiki so anyone can edit it.

2 Likes

@jeremy Technically speaking, Chinese is one language but with two set of written characters (simplified and traditional). To be honest, I (and most of the people I know) use them interchangeably. Currently, I am using the same data set as @shoof and converted it traditional characters via mafan. Should I train the traditional only or both? Since it will take very long to train a model, your direction is highly appreciated.

3 Likes

I am not an expert at Chinese or at Chinese text encoding, but my husband is legitimately an expert at Asian script text encoding. I asked him, and he said that Simplified and Traditional are not just different in the shape of the characters ā€“ itā€™s not like they are different fonts, itā€™s more like different spelling. (Please forgive me if I am telling you something you already know.) For example, he says there are multiple cases where there are several Traditional characters which all map to the same Simplified character. (This is why the Unicode standard has two different sets of encodings for the characters.)

One thing he does not know is how different the ā€œspellingā€ is between Simplified and Traditional. There are many characters where the Traditional and Simplified have the same Unicode encoding (like Horse, which is Unicode U+99AC for both Simplified and Traditional (and Japanese and Korean and Vietnamese)

In English, the regional spelling differences are minor enough that I think (I could be wrong!) that usually people just train on ā€œEnglishā€ and donā€™t worry about whether it is US or British or Australian or Indian English.

However, for US/British/Australian/Indian/etc., they all use characters from the same Unicode set ā€“ the same alphabet. A Latin ā€œGā€ is unicode U+FF27, regardless of whether itā€™s Australian or Canadian. However, the Traditional Chinese for the first symbol in ā€œcountryā€ (國) is unicode U+570B while the Simplified (国) is U+56FD. This means that whatever model you have is going to think that 國 and 国 are completely different words.

Now, maybe mafan is clever enough to know all the mappings between Simplified and Chinese ā€“ I donā€™t know and itā€™s late enough that Iā€™m not going to try to download and try to convert 國 to 国. If that is the case, then my non-Chinese-expert-self thinks itā€™s probably reasonable to just train on one.

However, using a simplified corpus which has been translated via mafan seems like it wouldnā€™t buy you anything. If mafan is that good, then you could just use mafan translate the Simplified on the input and output and youā€™d be done. If mafan is not that good, then I would think you would need to have a Traditional corpus.

It might also be ā€“ I donā€™t know ā€“ that there will be subtle differences in the words used in Simplified and Traditional corpuses. Just like there were words in the IMDB database which were not in Wikitext103, maybe there is e.g. a minor celebrity star in Hong Kong whose name has unusual characters which are not used commonly in China. So I would think that if you want to do Traditional, you should get a Traditional corpus, not just translate a Simplified corpus.

My opinion, probably worth as much as you paid for it.

3 Likes

I am gonna work on Sanskritā€¦ :slightly_smiling_face:

4 Likes

@Moody and I are both Chinese speakers (her natively, me poorly!) so weā€™re familiar with the issue. Itā€™s an interesting one and your husbandā€™s analogy is a pretty good one. However any English analogy is going to have problems since this issue is fairly specific to logographic scripts.

In the case of 国 itā€™s easy, since thereā€™s a clear 1-1 reversible mapping. The problem is that not all characters have that.

3 Likes

Yeah soā€¦ itā€™s a shame you didnā€™t go the other way, I think, since IIRC every traditional char maps to a single simplified char, but not visa versa. So that would be more reliable. You can map a simplified corpus to traditional, but because the mapping isnā€™t unique, you need to use a language model (hah!) or at least n-gram frequencies to handle the ambiguities. According to opencc for example, they just map to the first match if thereā€™s an ambiguity. I donā€™t know if hanziconv and mafan do the same thing - I wouldnā€™t be surprised if thatā€™s all they did, unfortunately.

I donā€™t think itā€™s going to matter too much however if you end up with a slightly-wrong corpus. Simplified Chinese characters are designed such that ambiguities are unlikely to be a problem in understanding language. So Iā€™d guess youā€™ll be fine - but just be aware that the issue exists.

1 Like

FYI, my tokenization seems stalled after a while and I used chunk size = 24k as in the notebook.
12%20AM
Iā€™m using 5k instead and so far so good. Maybe worth noting in case you donā€™t have a powerful machine like the one Jeremy has!

1 Like

The 1-1 mapping thing really bothered me for a while too! I think simplification actually removed the ā€œsoulā€ of the language. Once gone, itā€™s not easy to get it back (maybe a NN mapping could be better?)!

1 Like

Well I love the simplified characters personally. I think itā€™s an extraordinary linguistic project that was done very well.
Top Chinese writers throughout history have railed against the complexity of the character set. For a long time it was explicitly designed to maximize exclusivity of the educated class!

5 Likes

Bengali

@jeremy During tokenization, Iā€™ve been reducing batch sizes from 5k to 2k now, and every time when my RAM maxes out, all the CPU cores turn ā€œquiteā€ like this one. at the same time Swp increases, and my iteration seems ā€œstalledā€ at this stage (iteration 321). I should have noticed it in my previous post suggesting batch size = 5k. This image seems like a ā€œdeath sentenceā€ā€¦
28%20PM

If I wait longer (like the previous few times), I get an error message.
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
35%20PM
I think @lesscomfortable had a similar problem but he used batch 5k and perhaps a more powerful paperspace instance.

Would you recommend anything for this case? I donā€™t know much about garbage collection (reading it) or how the program keeps threads alive. Reducing batch size doesnā€™t seem to solve the memory problem. I also thought about creating and enabling a swp file, but even if there is Swp space, the processes wouldnā€™t move forward.

Thank you.

1 Like

My machine was 32GB RAM. I would suggest to keep on reducing the chunk size, eventually one will work (it did for me :slight_smile:). But also save your progress so you donā€™t lose everything when (if) it crashes. I divided my trn_set into 12 and run the tokenizator on each of them, thus saving my progress.

But do not let go. The training part works fine if you can pass the tokenizer.

2 Likes

Yeah itā€™s running out of memory - your swap is full. Maybe try running on less cores? (Or even just run one one core?) Iā€™m not sure why itā€™s using so much memory - Iā€™m no expert at debugging Python memory issues. Thereā€™s a few memory debugging tips around, eg

2 Likes

Thanks Jeremy. I tried the single-core version proc_all while increased chunksize and it has the same issue of stalling after n_iter * batch size * text size per batch > RAM. I think the lists were just getting too big and Iā€™m going around the problem by doing what @lesscomfortable did to save the list every n iterations incrementally, and then concat all at the end.

4 Likes

Good afternoon,

I started to attempt training a language model for korean as I planned to classify toxic comments.
I am currently using Konlpy for the tokeniser but sentencepiece suggested by Jeremy looks interesting as well.
I will try with what I have got at the moment first and update you. Thanks.

1 Like

A note to those folks building langage models: thereā€™s no reason to go beyond 100 million tokens - in my experiments it didnā€™t help. So if your corpus is bigger than that, remove some of the smaller articles (for instance) until the corpus is down to that size. Really large corpuses are a pain to work with, and donā€™t have any benefits I saw.

11 Likes

To help you to get started, here is the procedure to download data from Wikipedia.

  1. Go to Wikimedia https://dumps.wikimedia.org/

  2. Click on the ā€œDatabase backup dumpsā€ (WikiDumps) link. (It took me a while to figure out it is a link!)
    image

  3. There will be a long list inside the WikiDump. In this example, I pick ā€˜zh_yueā€™ for Cantonese (a subset of Chinese) and download it. (Warning: some of the file are very big)
    image

  4. Git Clone from WikiExtractor (https://github.com/attardi/wikiextractor)
    $ git clone https://github.com/attardi/wikiextractor.git

  5. Under WikiExtactor directory, then install it by typing
    (sudo) python setup.py install

  6. Syntax for extracting files into json format:
    WikiExtractor.py -s --json -o {new_folder_name} {wikidumps_file_name}
    (Note: the {new_folder_name} will be created during extracting;
    more download options available under WikiExtractor readme)
    Example: $ WikiExtractor.py -s --json -o cantonese zh_yuewiki-20180401-pages-meta-current.xml.bz2

25 Likes

Hi everyone, I wanna work on Sanskrit Language but I am not finding useful sources to download the data from. Also there isnā€™t any suitable Tokenizer that I know of as of now. Please guide me to appropriate resources if somebody know.
Also for the tokenization I am thinking to use Sentencepiece that @jeremy mentioned in Lesson 10. I have gone through the github page but I am unable to figure out how it works ( I am not good with programming and command lines ā€¦:sweat_smile: ) . If anybody has tried it out please shine some light on its usage.