Language Model Zoo ­čŽŹ

(Kristian Rother) #391

This sounds like a straightforward ULMFiT problem, if I understand you correctly. My guess would be the approach is:

  1. Build or use a pre-existing language model (like Wikitext103)
  2. Transform your dataset from [id, sentence, emotion] to [0, sentence] because you train your language model on unlabeled data. Also split it into train/validation
  3. Use the new dataset to finetune (load the LM weights from 1, retrain). Save the model, save the encoder
  4. Load the encoder and train a classifier with your [id, sentence, emotion] dataset (since the emotion is the label)
  5. Use predict to write your [id, emotion] target. You have to map the ids somehow.

Also note that this is multilabel classification and not binary as in most default examples. Check out the documentation or the RNN video from 2019 (lesson 3 iirc) and the corresponding notebook.

0 Likes

(Fred Guth) #392

I was able to create a pt-br LM and have saved the model .pth and the itos.pkl.
Now I want to classify a different corpus and use my pretrained language model. I was not able to reproduce IMDB because it does not show how to load a model, it assumes you are doing it with english and download the pretrained wiki103 english lm.

Is there a notebook showing how to classify using your pretrained lm?

0 Likes

#393

I would like to ask you do you create translation modules based on language modules? Such as German to English or so, like we have in Google Translate, would that also be the sub purpose of this thread?

0 Likes

(Johannes Lackner) #394

Hi,
I loaded the model weights (.pth) & itos.pkl from a german LM into my LM learner like this:


You train the LM, then save the encoder part. Then you set up your classifier (as described in the course v3 IMDB notebook), load your LM encoder into it and classify:

learn = text_classifier_learner(data, AWD_LSTM, pretrained = False, drop_mult=0.05)
learn.load_encoder(ÔÇťyour_LM_encoder_fileÔÇŁ)

0 Likes

(Serge Mankovski) #395

Is there a repository for the models? I am training a bacterial genome language model that was shared by @KarlH and it seems that I am getting somewhere.

The model did not do very well on a small sample of genomes, but after increasing number of genomes from a couple of dozens to a few thousands made a difference. This model might turn out useful for bioinformatics after all. But boy, is it training slowlyÔÇŽ it is like watching paint dry :slight_smile:

1 Like

(joao.br) #396

Hey Guys.

Im training a portuguese LM, and im getting this error.

Error occurs when i try to use TokenizerÔÇŽ

**OverflowError: cannot serialize a bytes object larger than 4 GiB**

Anyone of you know how i can correct this? In foruns and on stackoverflow, it seems like this error should occur only on python versions below 3.4, but im using 3.6.8 and cant get away this errorÔÇŽ

Here the code

def get_texts(df, number_of_labels=1):
# label - all zeros for now?!
labels = df.iloc[:,range(number_of_labels)].values.astype(np.int64)
# first line
texts = f'\n{BOS} {FLD} 1 ' + df[number_of_labels].astype(str) # to mark the beginning of the sentence

# other lines
for i in range(number_of_labels + 1, len(df.columns)): texts += f' {FLD} {i - number_of_labels} ' + df[i].astype(str)

texts = texts.values.astype(str)

# chunk the text, token by token (word by word)
cores_p = partition_by_cores(texts, 1)
tok = Tokenizer(lang='pt')
########################################################################################################################################
tok = tok.process_all(cores_p) ## This line causes the error !!!!
########################################################################################################################################
return tok, list(labels)

linux disttribution im usingÔÇŽ

NAME="Red Hat Enterprise Linux Server"
VERSION="7.2 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.2"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.2
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.2"

Python version

Python 3.6.8 :: Anaconda custom (64-bit)

Stacktrace

Traceback (most recent call last):
 File "/home/joasilva/anaconda3/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/joasilva/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

Anyone can help me on this??

Thanks

0 Likes

#397

Is there a pre-trained language model using Transformer or TransformerXL architecture available for English? If not I am planning to train one and share it. I was wondering what was a preprocessing used to train AWT_LSTM that is available as a default pre-trained model. Also, does anyone know some good training parameters to start from? I was thinking about just going with 10 epochs fit one cycle and adding some to that momentum

0 Likes

(Johannes Lackner) #398

@Kaspar suggested some hyperparameters for training TransformerXL in this thread:
https://forums.fast.ai/t/training-transformerxl/40104/15

1 Like

#399

Thanks, unfortunately I donÔÇÖt have access to this topic

0 Likes

(Johannes Lackner) #400

I see. @cduguet trained a Spanish Wikipedia language model on TransformerXLÔÇŽ

ÔÇŽ and here is an attempt I ran recently doing the same for German Wikipedia on TransformerXL- a couple of hyperparams are evident from both github examples.

0 Likes

(Piotr Czapla) #401

The issue is multiprocessing, your training dataset seems to be larger than something that can be handled by pickle. Either make the dataset smaller, or add additional chunking, or turn off multiprocessing and leave it for a week

0 Likes

(Kerem Turgutlu) #402

I will train a language model for Turkish. I dumped the wiki articles and extracted them into json thanks to @Moody :slight_smile:

Now, I am wondering if there is any best practices for preprocessing the wiki text before tokenization step.

For example some trivial things IÔÇÖve come across is to replace \n\n with single whitespace, to remove Section::: and also remaining html tags from the articles as itÔÇÖs not language related but rather related to wikipediaÔÇÖs structure.

Here is a sample text (SPOILER ALERT)

Albus Percival Wulfric Brian Dumbledore (1881-1997), J.K. Rowling taraf─▒ndan yaz─▒lm─▒┼č Harry Potter serisindeki bir kurgusal karakterdir.

├çok zeki, ara┼čt─▒rmac─▒, sakin ve kendini duygular─▒na kapt─▒rmayan ├žok g├╝├žl├╝ bir b├╝y├╝c├╝d├╝r. Gen├žli─činde a┼č─▒r─▒ g├╝├ž merakl─▒s─▒d─▒r, daha sonra daha mant─▒kl─▒ davranmaya karar verir. (Kibar, biraz ilgin├ž ve g├╝├žl├╝ yap─▒s─▒yla tipik iyi b├╝y├╝c├╝ ├Âzelliklerini ta┼č─▒maktad─▒r.) Harry Potter'─▒n sorunlar─▒n─▒ anlay─▒┼čla kar┼č─▒lamas─▒yla ona di─čer ├Â─čretmenlerden daha 'iyi' davrand─▒─č─▒ s├Âylenebilir. Herkes taraf─▒ndan sevilen ve say─▒lan bir b├╝y├╝c├╝ olan Dumbledore, Lord Voldemort'un korktu─ču yegane b├╝y├╝c├╝d├╝r. Sihir otoritelerinin genel kan─▒s─▒na g├Âre de gelmi┼č ge├žmi┼č en g├╝├žl├╝ b├╝y├╝c├╝d├╝r. Dumbledore'un ya┼čam─▒ 116 y─▒l s├╝rm├╝┼čt├╝r. Alt─▒nc─▒ kitapta (Harry Potter ve Melez Prens) Severus Snape taraf─▒ndan Avada Kedavra lanetiyle ├Âld├╝r├╝len Dumbledore, 1944-1997 tarihleri aras─▒nda Hogwarts Cad─▒l─▒k ve B├╝y├╝c├╝l├╝k Okulu'nun m├╝d├╝rl├╝─č├╝n├╝ yapm─▒┼čt─▒r.

Uzun ve ince olarak betimlenen Dumbledore'un uzun sa├ž ve sakallar─▒ vard─▒r. ├ťnl├╝ b├╝y├╝c├╝n├╝n, mavi g├Âzleri, ├žok uzun ve kancal─▒ bir burnu ve uzun parmaklar─▒ vard─▒r. Yar─▒m ay ├žer├ževeli g├Âzl├╝kleri ve ┼ča┼čaal─▒ c├╝bbesi ilk g├Âze ├žarpan ┼čeylerdir. Sol dizinin ├╝st├╝nde Londra metro'sunun haritas─▒n─▒ g├Âsteren, bir d├╝ellodan kalma bir yara izi vard─▒r. Dumbledore'un ├çikolatal─▒ Kurba─ča kart─▒na g├Âre oda m├╝zi─či ve on lobutlu bowlingden ho┼članmaktad─▒r. 1945'te kara b├╝y├╝c├╝ Grindelwald'u yenmesi ejderha kan─▒n─▒n 12 ayr─▒ konuda kullan─▒l─▒┼č─▒n─▒ bulmas─▒ ve arkada┼č─▒ Nicholas Flamel ile simya konusunda y├╝r├╝tt├╝─č├╝ ├žal─▒┼čmalarla ├╝nl├╝d├╝r. Sihirli ya da sihirsiz b├╝t├╝n ┼čekerli yiyeceklere kar┼č─▒ bir zaaf─▒ vard─▒r. Ofisini koruyan heykelin ┼čifresini de genellikle bu tatl─▒ isimlerinden se├žer. Ancak, Bertie Bott 1930 do─čumlu oldu─ču i├žin, Dumbledore'un "gen├žlik" ile neyi kastetiti─či anla┼č─▒lamam─▒┼čt─▒r. En sevdi─či tatlar ise, B├Â─č├╝rtlen ve Marmelat't─▒r. Dumbledore, ayn─▒ zamanda bir ├Ârg├╝ merakl─▒s─▒d─▒r. Ayr─▒ca yazar J.K Rowling' in yapt─▒─č─▒ a├ž─▒klamaya g├Âre kendisi e┼čcinseldir ve Gellert Grindelwald'a a┼č─▒kt─▒r.

Yazar─▒n b├Âylesine bilge bir ki┼čiye "Albus Dumbledore" ismini vermesi rastgele yap─▒lm─▒┼č bir se├žim de─čildir. Albus, Latince "beyaz" anlam─▒na gelir ve "bilgelik" ile "ayd─▒nlanmay─▒" temsil eder. Dumbledore ise "yabanar─▒s─▒" (─░ngilizce "bumblebee") anlam─▒na gelmekle yazar taraf─▒ndan ├Âzellikle se├žilmi┼čtir ├ž├╝nk├╝ ─░ngilizce'de "bumble around", "etrafta dikkatsizce gezinmek" demektir. Yazar Dumbledore' u yarat─▒rken onun Hogwarts koridorlar─▒nda dola┼čt─▒─č─▒n─▒ hayal etti─či i├žin bu fiille ilintili bir isim se├žmi┼čtir.

**Section::::Karakter geli┼čimi.**
Dumbledore'un bir k─▒z karde┼či ve bir erkek karde┼či vard─▒r

Are there any best practices to follow for this dataset?

Thanks! :smiley:

def fix_html() doesnÔÇÖt seem to account for all html tags. For example:

 ('<nowiki>', 7815),
 ('<br>', 7165),
 ('<BR>', 582),
 ('</div>', 572),
 ('<onlyinclude>', 555),
 ('</onlyinclude>', 539),
 ('<br \\n>', 461),
 ('<li>', 447),
 ('</ref>', 445),
 ('<noinclude>', 194),
 ('<ENTER>', 59),
 ('</noinclude>', 58),
 ('</poem>', 54),
...

P.S. Only 6305/328830 articles have such html tags, so I can simply discard them. But I was just curious to know what more experienced language modelers do as a best practice.

1 Like

(Kerem Turgutlu) #403

Are there any forum threads pointing out how to deal with gpu memory issues while training language models from scratch. I am not able to fit with bs>4 and fp16 doesnÔÇÖt seem to work with this case?

Right now I am testing stuff with a single 2080ti.

0 Likes

(Momchil Zhelev) #404

Hi,

Did an attempt to train Bulgarian language model using Wikipedia corpus and following the Telugu model notebook. Thought to share some stats and results below. Gradient stops after 12hrs so did 3x5 epochs, instead of 15. Seems still underfitting, will try play with reducing dropout to address this. Currently looking for suitable data for classification. Not sure where to find benchmark numbers for Bulgarian language models.

Bulgarian wiki stats:
ÔÇó Number of documents: 251,877
ÔÇó Number of words: 53,811,583
ÔÇó Number of unique tokens: 3,273,966

Results:

image

0 Likes