Language Model Zoo 🦍

This sounds like a straightforward ULMFiT problem, if I understand you correctly. My guess would be the approach is:

  1. Build or use a pre-existing language model (like Wikitext103)
  2. Transform your dataset from [id, sentence, emotion] to [0, sentence] because you train your language model on unlabeled data. Also split it into train/validation
  3. Use the new dataset to finetune (load the LM weights from 1, retrain). Save the model, save the encoder
  4. Load the encoder and train a classifier with your [id, sentence, emotion] dataset (since the emotion is the label)
  5. Use predict to write your [id, emotion] target. You have to map the ids somehow.

Also note that this is multilabel classification and not binary as in most default examples. Check out the documentation or the RNN video from 2019 (lesson 3 iirc) and the corresponding notebook.

I was able to create a pt-br LM and have saved the model .pth and the itos.pkl.
Now I want to classify a different corpus and use my pretrained language model. I was not able to reproduce IMDB because it does not show how to load a model, it assumes you are doing it with english and download the pretrained wiki103 english lm.

Is there a notebook showing how to classify using your pretrained lm?

I would like to ask you do you create translation modules based on language modules? Such as German to English or so, like we have in Google Translate, would that also be the sub purpose of this thread?

Hi,
I loaded the model weights (.pth) & itos.pkl from a german LM into my LM learner like this:


You train the LM, then save the encoder part. Then you set up your classifier (as described in the course v3 IMDB notebook), load your LM encoder into it and classify:

learn = text_classifier_learner(data, AWD_LSTM, pretrained = False, drop_mult=0.05)
learn.load_encoder(“your_LM_encoder_file”)

2 Likes

Is there a repository for the models? I am training a bacterial genome language model that was shared by @KarlH and it seems that I am getting somewhere.

The model did not do very well on a small sample of genomes, but after increasing number of genomes from a couple of dozens to a few thousands made a difference. This model might turn out useful for bioinformatics after all. But boy, is it training slowly… it is like watching paint dry :slight_smile:

2 Likes

Hey Guys.

Im training a portuguese LM, and im getting this error.

Error occurs when i try to use Tokenizer…

**OverflowError: cannot serialize a bytes object larger than 4 GiB**

Anyone of you know how i can correct this? In foruns and on stackoverflow, it seems like this error should occur only on python versions below 3.4, but im using 3.6.8 and cant get away this error…

Here the code

def get_texts(df, number_of_labels=1):
# label - all zeros for now?!
labels = df.iloc[:,range(number_of_labels)].values.astype(np.int64)
# first line
texts = f'\n{BOS} {FLD} 1 ' + df[number_of_labels].astype(str) # to mark the beginning of the sentence

# other lines
for i in range(number_of_labels + 1, len(df.columns)): texts += f' {FLD} {i - number_of_labels} ' + df[i].astype(str)

texts = texts.values.astype(str)

# chunk the text, token by token (word by word)
cores_p = partition_by_cores(texts, 1)
tok = Tokenizer(lang='pt')
########################################################################################################################################
tok = tok.process_all(cores_p) ## This line causes the error !!!!
########################################################################################################################################
return tok, list(labels)

linux disttribution im using…

NAME="Red Hat Enterprise Linux Server"
VERSION="7.2 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.2"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.2
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.2"

Python version

Python 3.6.8 :: Anaconda custom (64-bit)

Stacktrace

Traceback (most recent call last):
 File "/home/joasilva/anaconda3/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/joasilva/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

Anyone can help me on this??

Thanks

Is there a pre-trained language model using Transformer or TransformerXL architecture available for English? If not I am planning to train one and share it. I was wondering what was a preprocessing used to train AWT_LSTM that is available as a default pre-trained model. Also, does anyone know some good training parameters to start from? I was thinking about just going with 10 epochs fit one cycle and adding some to that momentum

@Kaspar suggested some hyperparameters for training TransformerXL in this thread:
https://forums.fast.ai/t/training-transformerxl/40104/15

1 Like

Thanks, unfortunately I don’t have access to this topic

I see. @cduguet trained a Spanish Wikipedia language model on TransformerXL

… and here is an attempt I ran recently doing the same for German Wikipedia on TransformerXL- a couple of hyperparams are evident from both github examples.

The issue is multiprocessing, your training dataset seems to be larger than something that can be handled by pickle. Either make the dataset smaller, or add additional chunking, or turn off multiprocessing and leave it for a week

I will train a language model for Turkish. I dumped the wiki articles and extracted them into json thanks to @Moody :slight_smile:

Now, I am wondering if there is any best practices for preprocessing the wiki text before tokenization step.

For example some trivial things I’ve come across is to replace \n\n with single whitespace, to remove Section::: and also remaining html tags from the articles as it’s not language related but rather related to wikipedia’s structure.

Here is a sample text (SPOILER ALERT)

Albus Percival Wulfric Brian Dumbledore (1881-1997), J.K. Rowling tarafından yazılmış Harry Potter serisindeki bir kurgusal karakterdir.

Çok zeki, araştırmacı, sakin ve kendini duygularına kaptırmayan çok güçlü bir büyücüdür. Gençliğinde aşırı güç meraklısıdır, daha sonra daha mantıklı davranmaya karar verir. (Kibar, biraz ilginç ve güçlü yapısıyla tipik iyi büyücü özelliklerini taşımaktadır.) Harry Potter'ın sorunlarını anlayışla karşılamasıyla ona diğer öğretmenlerden daha 'iyi' davrandığı söylenebilir. Herkes tarafından sevilen ve sayılan bir büyücü olan Dumbledore, Lord Voldemort'un korktuğu yegane büyücüdür. Sihir otoritelerinin genel kanısına göre de gelmiş geçmiş en güçlü büyücüdür. Dumbledore'un yaşamı 116 yıl sürmüştür. Altıncı kitapta (Harry Potter ve Melez Prens) Severus Snape tarafından Avada Kedavra lanetiyle öldürülen Dumbledore, 1944-1997 tarihleri arasında Hogwarts Cadılık ve Büyücülük Okulu'nun müdürlüğünü yapmıştır.

Uzun ve ince olarak betimlenen Dumbledore'un uzun saç ve sakalları vardır. Ünlü büyücünün, mavi gözleri, çok uzun ve kancalı bir burnu ve uzun parmakları vardır. Yarım ay çerçeveli gözlükleri ve şaşaalı cübbesi ilk göze çarpan şeylerdir. Sol dizinin üstünde Londra metro'sunun haritasını gösteren, bir düellodan kalma bir yara izi vardır. Dumbledore'un Çikolatalı Kurbağa kartına göre oda müziği ve on lobutlu bowlingden hoşlanmaktadır. 1945'te kara büyücü Grindelwald'u yenmesi ejderha kanının 12 ayrı konuda kullanılışını bulması ve arkadaşı Nicholas Flamel ile simya konusunda yürüttüğü çalışmalarla ünlüdür. Sihirli ya da sihirsiz bütün şekerli yiyeceklere karşı bir zaafı vardır. Ofisini koruyan heykelin şifresini de genellikle bu tatlı isimlerinden seçer. Ancak, Bertie Bott 1930 doğumlu olduğu için, Dumbledore'un "gençlik" ile neyi kastetitiği anlaşılamamıştır. En sevdiği tatlar ise, Böğürtlen ve Marmelat'tır. Dumbledore, aynı zamanda bir örgü meraklısıdır. Ayrıca yazar J.K Rowling' in yaptığı açıklamaya göre kendisi eşcinseldir ve Gellert Grindelwald'a aşıktır.

Yazarın böylesine bilge bir kişiye "Albus Dumbledore" ismini vermesi rastgele yapılmış bir seçim değildir. Albus, Latince "beyaz" anlamına gelir ve "bilgelik" ile "aydınlanmayı" temsil eder. Dumbledore ise "yabanarısı" (İngilizce "bumblebee") anlamına gelmekle yazar tarafından özellikle seçilmiştir çünkü İngilizce'de "bumble around", "etrafta dikkatsizce gezinmek" demektir. Yazar Dumbledore' u yaratırken onun Hogwarts koridorlarında dolaştığını hayal ettiği için bu fiille ilintili bir isim seçmiştir.

**Section::::Karakter gelişimi.**
Dumbledore'un bir kız kardeşi ve bir erkek kardeşi vardır

Are there any best practices to follow for this dataset?

Thanks! :smiley:

def fix_html() doesn’t seem to account for all html tags. For example:

 ('<nowiki>', 7815),
 ('<br>', 7165),
 ('<BR>', 582),
 ('</div>', 572),
 ('<onlyinclude>', 555),
 ('</onlyinclude>', 539),
 ('<br \\n>', 461),
 ('<li>', 447),
 ('</ref>', 445),
 ('<noinclude>', 194),
 ('<ENTER>', 59),
 ('</noinclude>', 58),
 ('</poem>', 54),
...

P.S. Only 6305/328830 articles have such html tags, so I can simply discard them. But I was just curious to know what more experienced language modelers do as a best practice.

1 Like

Are there any forum threads pointing out how to deal with gpu memory issues while training language models from scratch. I am not able to fit with bs>4 and fp16 doesn’t seem to work with this case?

Right now I am testing stuff with a single 2080ti.

Hi,

Did an attempt to train Bulgarian language model using Wikipedia corpus and following the Telugu model notebook. Thought to share some stats and results below. Gradient stops after 12hrs so did 3x5 epochs, instead of 15. Seems still underfitting, will try play with reducing dropout to address this. Currently looking for suitable data for classification. Not sure where to find benchmark numbers for Bulgarian language models.

Bulgarian wiki stats:
• Number of documents: 251,877
• Number of words: 53,811,583
• Number of unique tokens: 3,273,966

Results:

image

Hi Everyone,

I forgot to post to this thread a while back when I had a go with fastai v1 and ULMFit applied to Greek.


The results on the only “recent” Greek language NLP classification task I could find were better than those in the corresponding paper. I intend to update the code to include recent changes to the api and the forward and backward approach from the ULMFit paper.

1 Like

Made a norwegian bokmål/nynorsk language model.
Perplexity 22.3
https://github.com/AugustIndal/Norwegian-no-nn-ULMFiT-language-model.

Hi! I trained a model on Finnish wikipedia without sentencepiece and got some ok results. We are using the model in our hospital biobank to classify patient smoking sentences etc. and got really good results. The idea is to also train the model on our own patient dictation text (we have about 20 gigabytes of it and cannot release that anywhere of course).

I’m really interested in trying out sentencepiece to rerun this and especially when we train the actual model on our own text. What kind of results did you get? If you have a working sentencepiece version already, could you try the classification from here so maybe we could compare the results?

If you can’t find a benchmakr think about creating one. Either from some movie reviews or you can predict the genere or book author from text fragments. You can make a base line using the NBSVM presented by Jeremy https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline

The Perplexity alone does not tell you much how it will perform on classification unfortunately. Could you try to test that on some existing benchmark?

We will have a high level command line tools to do just that. I’ve made a preliminary attempt to create them, but we still need a data layer that detects the dataset format automatically. I will keep you posted.

Hi @t-v.

I downloaded the latest Wikipedia French corpus and with the following code that counts the number of tokens (mostly words) by text file in the folder docs created thanks to nlputils.py, I got the number of about 492 millions of tokens.

If I understand well your post (and the one of Jeremy), I should keep only 100 millions tokens in docs (ie, a number of articles with a total sum of 100 millions tokens) before to create my LM databunch.

I’m going to delete a lot of training data. Can you confirm the process to follow? Thanks.

dest = path/'docs'
files = dest.ls()
num_tokens = 0

for i,f in enumerate(files):
    words = open(f, 'r', encoding='utf8').read()
    num_tokens += len(words.split())
print(i+1, num_tokens)
3 Likes

If you are looking for a labeled dataset in English US, English UK, French, German or Japanese to train and test your ULMFiT classifier, you can download the Amazon Customer Reviews Datasets.

If you need help, I published a guide on downloading.

1 Like