Language Model Zoo 🦍

(Davide Boschetto) #384

I’m about to start something for Italian since the assigned user has been inactive since last June, so I’m here asking: is the first post updated, with all the good tips working for the latest fastai version, or should I use some old version to make it work smoothly?

3 Likes

(Vinit Sutar) #385

Hi,

I’m currently working on a dataset which is tabular in nature. It contains categories of 7 emotions.
Given Data:
id, sentence, emotion
target:
id, emotion.

I want to use ULMFit to analyse this tabular dataset, and predict the sentiment of ids given based on the sentence for that corresponding id.

i’m confused as to how do i proceed after reading the csv file

0 Likes

(hanan) #386

hey all,
i trained the model in Hebrew Wikipedia.
https://github.com/hanan9m/hebrew_ULMFiT](https://github.com/hanan9m/hebrew_ULMFiT
can u update the status?
and should i just open new New Topic?

1 Like

#387

interested in this as well. Have been thinking about it for ages

0 Likes

(Piotr Czapla) #388

The Italian models are done in an effort of comparing ULMFiT against BERT, I need to find some time to move the modifications to fastai but for the time being the models can be found here: https://drive.google.com/drive/u/0/folders/1t3DqH0ZJC2gDMKEq9vTa5jqxBiVAnClR
They are working with https://github.com/n-waves/ulmfit-multilingual
I would love to see how it works on other Italian datasets than MLDoc.

Please open the thread. Hebrew is not tackle yet as far as I know. Have you found a suitable dataset to test ULMFiT against?

@miko, @DavideBoschetto, we have trained 2 Italian language models and one classification model on MLDoc. Having that set there are still some things that would be helpful to experiment with.

  • test the current models on other datasets than MLDoc - it would be best if you would add such dataset in the same way as we added mldoc and cls to the ulmfit-multilingual.
  • search some better hyperparameters for Italian. We tested only 3 models in a rather standard way maybe you find a better set of hyperparameters.
3 Likes

(hanan) #389

@piotr.czapla
hey, actually i just broke record of some banchmark that realease last yaer (in almost 2%!!). I’m in touch now with the auther to validate my results.
i open a thread ULMFiT - Hebrew

4 Likes

(Carlos Vouking) #390

‘nvidia-smi dmon’ and ‘nvidia-smi pmon’ commands could also be helpful.

1 Like

(Kristian Rother) #391

This sounds like a straightforward ULMFiT problem, if I understand you correctly. My guess would be the approach is:

  1. Build or use a pre-existing language model (like Wikitext103)
  2. Transform your dataset from [id, sentence, emotion] to [0, sentence] because you train your language model on unlabeled data. Also split it into train/validation
  3. Use the new dataset to finetune (load the LM weights from 1, retrain). Save the model, save the encoder
  4. Load the encoder and train a classifier with your [id, sentence, emotion] dataset (since the emotion is the label)
  5. Use predict to write your [id, emotion] target. You have to map the ids somehow.

Also note that this is multilabel classification and not binary as in most default examples. Check out the documentation or the RNN video from 2019 (lesson 3 iirc) and the corresponding notebook.

0 Likes

(Fred Guth) #392

I was able to create a pt-br LM and have saved the model .pth and the itos.pkl.
Now I want to classify a different corpus and use my pretrained language model. I was not able to reproduce IMDB because it does not show how to load a model, it assumes you are doing it with english and download the pretrained wiki103 english lm.

Is there a notebook showing how to classify using your pretrained lm?

0 Likes

#393

I would like to ask you do you create translation modules based on language modules? Such as German to English or so, like we have in Google Translate, would that also be the sub purpose of this thread?

0 Likes

(Johannes Lackner) #394

Hi,
I loaded the model weights (.pth) & itos.pkl from a german LM into my LM learner like this:


You train the LM, then save the encoder part. Then you set up your classifier (as described in the course v3 IMDB notebook), load your LM encoder into it and classify:

learn = text_classifier_learner(data, AWD_LSTM, pretrained = False, drop_mult=0.05)
learn.load_encoder(“your_LM_encoder_file”)

0 Likes

(Serge Mankovski) #395

Is there a repository for the models? I am training a bacterial genome language model that was shared by @KarlH and it seems that I am getting somewhere.

The model did not do very well on a small sample of genomes, but after increasing number of genomes from a couple of dozens to a few thousands made a difference. This model might turn out useful for bioinformatics after all. But boy, is it training slowly… it is like watching paint dry :slight_smile:

1 Like

(joao.br) #396

Hey Guys.

Im training a portuguese LM, and im getting this error.

Error occurs when i try to use Tokenizer…

**OverflowError: cannot serialize a bytes object larger than 4 GiB**

Anyone of you know how i can correct this? In foruns and on stackoverflow, it seems like this error should occur only on python versions below 3.4, but im using 3.6.8 and cant get away this error…

Here the code

def get_texts(df, number_of_labels=1):
# label - all zeros for now?!
labels = df.iloc[:,range(number_of_labels)].values.astype(np.int64)
# first line
texts = f'\n{BOS} {FLD} 1 ' + df[number_of_labels].astype(str) # to mark the beginning of the sentence

# other lines
for i in range(number_of_labels + 1, len(df.columns)): texts += f' {FLD} {i - number_of_labels} ' + df[i].astype(str)

texts = texts.values.astype(str)

# chunk the text, token by token (word by word)
cores_p = partition_by_cores(texts, 1)
tok = Tokenizer(lang='pt')
########################################################################################################################################
tok = tok.process_all(cores_p) ## This line causes the error !!!!
########################################################################################################################################
return tok, list(labels)

linux disttribution im using…

NAME="Red Hat Enterprise Linux Server"
VERSION="7.2 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.2"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.2
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.2"

Python version

Python 3.6.8 :: Anaconda custom (64-bit)

Stacktrace

Traceback (most recent call last):
 File "/home/joasilva/anaconda3/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/joasilva/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

Anyone can help me on this??

Thanks

0 Likes

#397

Is there a pre-trained language model using Transformer or TransformerXL architecture available for English? If not I am planning to train one and share it. I was wondering what was a preprocessing used to train AWT_LSTM that is available as a default pre-trained model. Also, does anyone know some good training parameters to start from? I was thinking about just going with 10 epochs fit one cycle and adding some to that momentum

0 Likes

(Johannes Lackner) #398

@Kaspar suggested some hyperparameters for training TransformerXL in this thread:
https://forums.fast.ai/t/training-transformerxl/40104/15

1 Like

#399

Thanks, unfortunately I don’t have access to this topic

0 Likes

(Johannes Lackner) #400

I see. @cduguet trained a Spanish Wikipedia language model on TransformerXL

… and here is an attempt I ran recently doing the same for German Wikipedia on TransformerXL- a couple of hyperparams are evident from both github examples.

0 Likes

(Piotr Czapla) #401

The issue is multiprocessing, your training dataset seems to be larger than something that can be handled by pickle. Either make the dataset smaller, or add additional chunking, or turn off multiprocessing and leave it for a week

0 Likes

(Kerem Turgutlu) #402

I will train a language model for Turkish. I dumped the wiki articles and extracted them into json thanks to @Moody :slight_smile:

Now, I am wondering if there is any best practices for preprocessing the wiki text before tokenization step.

For example some trivial things I’ve come across is to replace \n\n with single whitespace, to remove Section::: and also remaining html tags from the articles as it’s not language related but rather related to wikipedia’s structure.

Here is a sample text (SPOILER ALERT)

Albus Percival Wulfric Brian Dumbledore (1881-1997), J.K. Rowling tarafından yazılmış Harry Potter serisindeki bir kurgusal karakterdir.

Çok zeki, araştırmacı, sakin ve kendini duygularına kaptırmayan çok güçlü bir büyücüdür. Gençliğinde aşırı güç meraklısıdır, daha sonra daha mantıklı davranmaya karar verir. (Kibar, biraz ilginç ve güçlü yapısıyla tipik iyi büyücü özelliklerini taşımaktadır.) Harry Potter'ın sorunlarını anlayışla karşılamasıyla ona diğer öğretmenlerden daha 'iyi' davrandığı söylenebilir. Herkes tarafından sevilen ve sayılan bir büyücü olan Dumbledore, Lord Voldemort'un korktuğu yegane büyücüdür. Sihir otoritelerinin genel kanısına göre de gelmiş geçmiş en güçlü büyücüdür. Dumbledore'un yaşamı 116 yıl sürmüştür. Altıncı kitapta (Harry Potter ve Melez Prens) Severus Snape tarafından Avada Kedavra lanetiyle öldürülen Dumbledore, 1944-1997 tarihleri arasında Hogwarts Cadılık ve Büyücülük Okulu'nun müdürlüğünü yapmıştır.

Uzun ve ince olarak betimlenen Dumbledore'un uzun saç ve sakalları vardır. Ünlü büyücünün, mavi gözleri, çok uzun ve kancalı bir burnu ve uzun parmakları vardır. Yarım ay çerçeveli gözlükleri ve şaşaalı cübbesi ilk göze çarpan şeylerdir. Sol dizinin üstünde Londra metro'sunun haritasını gösteren, bir düellodan kalma bir yara izi vardır. Dumbledore'un Çikolatalı Kurbağa kartına göre oda müziği ve on lobutlu bowlingden hoşlanmaktadır. 1945'te kara büyücü Grindelwald'u yenmesi ejderha kanının 12 ayrı konuda kullanılışını bulması ve arkadaşı Nicholas Flamel ile simya konusunda yürüttüğü çalışmalarla ünlüdür. Sihirli ya da sihirsiz bütün şekerli yiyeceklere karşı bir zaafı vardır. Ofisini koruyan heykelin şifresini de genellikle bu tatlı isimlerinden seçer. Ancak, Bertie Bott 1930 doğumlu olduğu için, Dumbledore'un "gençlik" ile neyi kastetitiği anlaşılamamıştır. En sevdiği tatlar ise, Böğürtlen ve Marmelat'tır. Dumbledore, aynı zamanda bir örgü meraklısıdır. Ayrıca yazar J.K Rowling' in yaptığı açıklamaya göre kendisi eşcinseldir ve Gellert Grindelwald'a aşıktır.

Yazarın böylesine bilge bir kişiye "Albus Dumbledore" ismini vermesi rastgele yapılmış bir seçim değildir. Albus, Latince "beyaz" anlamına gelir ve "bilgelik" ile "aydınlanmayı" temsil eder. Dumbledore ise "yabanarısı" (İngilizce "bumblebee") anlamına gelmekle yazar tarafından özellikle seçilmiştir çünkü İngilizce'de "bumble around", "etrafta dikkatsizce gezinmek" demektir. Yazar Dumbledore' u yaratırken onun Hogwarts koridorlarında dolaştığını hayal ettiği için bu fiille ilintili bir isim seçmiştir.

**Section::::Karakter gelişimi.**
Dumbledore'un bir kız kardeşi ve bir erkek kardeşi vardır

Are there any best practices to follow for this dataset?

Thanks! :smiley:

def fix_html() doesn’t seem to account for all html tags. For example:

 ('<nowiki>', 7815),
 ('<br>', 7165),
 ('<BR>', 582),
 ('</div>', 572),
 ('<onlyinclude>', 555),
 ('</onlyinclude>', 539),
 ('<br \\n>', 461),
 ('<li>', 447),
 ('</ref>', 445),
 ('<noinclude>', 194),
 ('<ENTER>', 59),
 ('</noinclude>', 58),
 ('</poem>', 54),
...

P.S. Only 6305/328830 articles have such html tags, so I can simply discard them. But I was just curious to know what more experienced language modelers do as a best practice.

1 Like

(Kerem Turgutlu) #403

Are there any forum threads pointing out how to deal with gpu memory issues while training language models from scratch. I am not able to fit with bs>4 and fp16 doesn’t seem to work with this case?

Right now I am testing stuff with a single 2080ti.

0 Likes