Language Model Zoo 🦍

The issue is multiprocessing, your training dataset seems to be larger than something that can be handled by pickle. Either make the dataset smaller, or add additional chunking, or turn off multiprocessing and leave it for a week

I will train a language model for Turkish. I dumped the wiki articles and extracted them into json thanks to @Moody :slight_smile:

Now, I am wondering if there is any best practices for preprocessing the wiki text before tokenization step.

For example some trivial things I’ve come across is to replace \n\n with single whitespace, to remove Section::: and also remaining html tags from the articles as it’s not language related but rather related to wikipedia’s structure.

Here is a sample text (SPOILER ALERT)

Albus Percival Wulfric Brian Dumbledore (1881-1997), J.K. Rowling tarafından yazılmış Harry Potter serisindeki bir kurgusal karakterdir.

Çok zeki, araştırmacı, sakin ve kendini duygularına kaptırmayan çok güçlü bir büyücüdür. Gençliğinde aşırı güç meraklısıdır, daha sonra daha mantıklı davranmaya karar verir. (Kibar, biraz ilginç ve güçlü yapısıyla tipik iyi büyücü özelliklerini taşımaktadır.) Harry Potter'ın sorunlarını anlayışla karşılamasıyla ona diğer öğretmenlerden daha 'iyi' davrandığı söylenebilir. Herkes tarafından sevilen ve sayılan bir büyücü olan Dumbledore, Lord Voldemort'un korktuğu yegane büyücüdür. Sihir otoritelerinin genel kanısına göre de gelmiş geçmiş en güçlü büyücüdür. Dumbledore'un yaşamı 116 yıl sürmüştür. Altıncı kitapta (Harry Potter ve Melez Prens) Severus Snape tarafından Avada Kedavra lanetiyle öldürülen Dumbledore, 1944-1997 tarihleri arasında Hogwarts Cadılık ve Büyücülük Okulu'nun müdürlüğünü yapmıştır.

Uzun ve ince olarak betimlenen Dumbledore'un uzun saç ve sakalları vardır. Ünlü büyücünün, mavi gözleri, çok uzun ve kancalı bir burnu ve uzun parmakları vardır. Yarım ay çerçeveli gözlükleri ve şaşaalı cübbesi ilk göze çarpan şeylerdir. Sol dizinin üstünde Londra metro'sunun haritasını gösteren, bir düellodan kalma bir yara izi vardır. Dumbledore'un Çikolatalı Kurbağa kartına göre oda müziği ve on lobutlu bowlingden hoşlanmaktadır. 1945'te kara büyücü Grindelwald'u yenmesi ejderha kanının 12 ayrı konuda kullanılışını bulması ve arkadaşı Nicholas Flamel ile simya konusunda yürüttüğü çalışmalarla ünlüdür. Sihirli ya da sihirsiz bütün şekerli yiyeceklere karşı bir zaafı vardır. Ofisini koruyan heykelin şifresini de genellikle bu tatlı isimlerinden seçer. Ancak, Bertie Bott 1930 doğumlu olduğu için, Dumbledore'un "gençlik" ile neyi kastetitiği anlaşılamamıştır. En sevdiği tatlar ise, Böğürtlen ve Marmelat'tır. Dumbledore, aynı zamanda bir örgü meraklısıdır. Ayrıca yazar J.K Rowling' in yaptığı açıklamaya göre kendisi eşcinseldir ve Gellert Grindelwald'a aşıktır.

Yazarın böylesine bilge bir kişiye "Albus Dumbledore" ismini vermesi rastgele yapılmış bir seçim değildir. Albus, Latince "beyaz" anlamına gelir ve "bilgelik" ile "aydınlanmayı" temsil eder. Dumbledore ise "yabanarısı" (İngilizce "bumblebee") anlamına gelmekle yazar tarafından özellikle seçilmiştir çünkü İngilizce'de "bumble around", "etrafta dikkatsizce gezinmek" demektir. Yazar Dumbledore' u yaratırken onun Hogwarts koridorlarında dolaştığını hayal ettiği için bu fiille ilintili bir isim seçmiştir.

**Section::::Karakter gelişimi.**
Dumbledore'un bir kız kardeşi ve bir erkek kardeşi vardır

Are there any best practices to follow for this dataset?

Thanks! :smiley:

def fix_html() doesn’t seem to account for all html tags. For example:

 ('<nowiki>', 7815),
 ('<br>', 7165),
 ('<BR>', 582),
 ('</div>', 572),
 ('<onlyinclude>', 555),
 ('</onlyinclude>', 539),
 ('<br \\n>', 461),
 ('<li>', 447),
 ('</ref>', 445),
 ('<noinclude>', 194),
 ('<ENTER>', 59),
 ('</noinclude>', 58),
 ('</poem>', 54),
...

P.S. Only 6305/328830 articles have such html tags, so I can simply discard them. But I was just curious to know what more experienced language modelers do as a best practice.

1 Like

Are there any forum threads pointing out how to deal with gpu memory issues while training language models from scratch. I am not able to fit with bs>4 and fp16 doesn’t seem to work with this case?

Right now I am testing stuff with a single 2080ti.

Hi,

Did an attempt to train Bulgarian language model using Wikipedia corpus and following the Telugu model notebook. Thought to share some stats and results below. Gradient stops after 12hrs so did 3x5 epochs, instead of 15. Seems still underfitting, will try play with reducing dropout to address this. Currently looking for suitable data for classification. Not sure where to find benchmark numbers for Bulgarian language models.

Bulgarian wiki stats:
• Number of documents: 251,877
• Number of words: 53,811,583
• Number of unique tokens: 3,273,966

Results:

image

Hi Everyone,

I forgot to post to this thread a while back when I had a go with fastai v1 and ULMFit applied to Greek.


The results on the only “recent” Greek language NLP classification task I could find were better than those in the corresponding paper. I intend to update the code to include recent changes to the api and the forward and backward approach from the ULMFit paper.

1 Like

Made a norwegian bokmål/nynorsk language model.
Perplexity 22.3
https://github.com/AugustIndal/Norwegian-no-nn-ULMFiT-language-model.

Hi! I trained a model on Finnish wikipedia without sentencepiece and got some ok results. We are using the model in our hospital biobank to classify patient smoking sentences etc. and got really good results. The idea is to also train the model on our own patient dictation text (we have about 20 gigabytes of it and cannot release that anywhere of course).

I’m really interested in trying out sentencepiece to rerun this and especially when we train the actual model on our own text. What kind of results did you get? If you have a working sentencepiece version already, could you try the classification from here so maybe we could compare the results?

If you can’t find a benchmakr think about creating one. Either from some movie reviews or you can predict the genere or book author from text fragments. You can make a base line using the NBSVM presented by Jeremy https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline

The Perplexity alone does not tell you much how it will perform on classification unfortunately. Could you try to test that on some existing benchmark?

We will have a high level command line tools to do just that. I’ve made a preliminary attempt to create them, but we still need a data layer that detects the dataset format automatically. I will keep you posted.

Hi @t-v.

I downloaded the latest Wikipedia French corpus and with the following code that counts the number of tokens (mostly words) by text file in the folder docs created thanks to nlputils.py, I got the number of about 492 millions of tokens.

If I understand well your post (and the one of Jeremy), I should keep only 100 millions tokens in docs (ie, a number of articles with a total sum of 100 millions tokens) before to create my LM databunch.

I’m going to delete a lot of training data. Can you confirm the process to follow? Thanks.

dest = path/'docs'
files = dest.ls()
num_tokens = 0

for i,f in enumerate(files):
    words = open(f, 'r', encoding='utf8').read()
    num_tokens += len(words.split())
print(i+1, num_tokens)
3 Likes

If you are looking for a labeled dataset in English US, English UK, French, German or Japanese to train and test your ULMFiT classifier, you can download the Amazon Customer Reviews Datasets.

If you need help, I published a guide on downloading.

1 Like

And ULMFiT has superior results on the CLS dataset: https://arxiv.org/pdf/1909.04761.pdf

2 Likes

Anyone is willing to share pretrained (english) ULMFIT or MULTIFIT LM weights, with the SentencePiece tokenizer?

Update:
I trained it myself https://www.kaggle.com/manyregression/fastai-en-wiki-500kk-pretrained-sp

There’s also more in the versions of this notebook - 100kk tokens, awd-lstm weights
https://www.kaggle.com/manyregression/sp-wikitext-vocab-lm-ipynb?scriptVersionId=27995530

Another question - there’s no point to use pretrained https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd if I chose SentencePiece, right?

Correct. You need consistent indices and tokens for encoding (training) and decoding (inference).

Funny, but I got slightly worse results when I fine tuned pretrained Spacy weights with SP and the ntrained a classifier https://www.kaggle.com/manyregression/fastai-ulmfit-google-quest-classifier-spacy?scriptVersionId=27771121

Any ideas why ULMFIT english regression model pretrained from 500kk wiki tokens failed while 100kk gave just worse results?

Here’s 500kk version https://www.kaggle.com/manyregression/fastai-ulmfit-google-quest-sp?scriptVersionId=28040078

For 100kk, the spearman metrics was 0.26 at best.

Hi I built a persian language model
here is the topic

Hi, I’m interested in knowing about your work. I’m a phd student in Tehran University.

Could one guide me how to implement MultiFit for a new language (the Persian language).

This is the notebook

It reads a pretrained model for Japanese, but I guess there is not such a model for Persian. Also, I don’t know the format of the models. I found a pretrained model for Persian in the following link,

however I don’t know if the model fits the project above?

I was so glad to have Ines and Matt presenting in-person about the new features of spaCY v3.0. Highlights include the data pipeline to store all the configures and hyper parameters in one place and APIs with other popular open source tools (such Weighs and Bias and FastAPI). My favorite feature is the ability to build (ie hard code) your own acronyms for specific domains or use cases. Enjoy!