Language Model Zoo 🦍

Glad to be of help. :slight_smile:

Actually what I meant by using the subtitles data is that you could use a trained English sentiment analysis model to label the English subtitle and then use the those labels for their corresponding Japanese ones.

For example: a English model could predict I'm very excited! to have a positive sentiment and you could attach the same label to the corresponding Japanese sentence 楽しみです! And you know the exact translation (‘exact’ to be taken with a bit of salt) because you have subtitles from both languages.

As far as scraping goes, there are scripts out there to scrape data from websites like Rakuten but I’m not sure if you could open source the data or the model, since its proprietary. :stuck_out_tongue:

News from the Spanish Classifier. I tested the classifier with TASS General Corpus dataset and I am getting a 0.57 Macro F1 Score which is SOTA as of last year (best result had been 0.562).

Right now the competition is taking place so we’ll see if a new SOTA is achieved for 2018.

Notebook

4 Likes

Ha, yes. I didn’t think as far as open sourcing the data or the model. It was more for my little experiment :slight_smile:

Hey Matthias,

I have a rookie question. Is it ok to evaluate the performance (in this case F1 score) of the model with the validation set? I did the same as you since I do not have the test set for the dataset.

My understanding is that this approach is fine as long as you do not use the validation data to train. Does anyone know better?

Thanks!

Hi Francisco,

my understanding would be that you should make a 50/25/25 split of your data into training set / validation set / test set and then use the validation set for tuning the model and the test for evaluating the model…

I merely wanted to get a quick intuition about the performance of my model, that’s why I used the validation set :slight_smile:

Best,

Matthias

1 Like

Subword sampling for sentencepiece can also go in at inference time, like learner.TTA() method for images.

If subword sampling during training + during inference is as effective as img transformations, it should soon become essential. Fastai would be pretty cutting edge if it made subword sampling as simple as image transformations.

How did your parameter tuning go? I must admit I don’t speak enough Korean to read the article you linked.
Did you try with Sylvain’s parameters, too?
It would be good to have a baseline to compare against.
I have played with the model (i.e. reimplemented bits to see what’s going on) that I think I could make an adaptation that deals with variable length inputs in the batches but I haven’t benchmarked it nor looked at the training process itself.

I think I (wrongly) bit off too many moving pieces to get a good grip on parameters yet.

  • SP settings (vocab size, special tokens)
    • Tried 24k, 32k, and finally 28k. 28k is working well.
  • Corpus prep – Korean wiki tends to lean pretty formal in tone and mostly statements of fact, I experimented with added novel text, internet comments etc. to both the SP corpus and the LM corpus
  • LSTM vs QRNN, CLR vs SGD, and all relevant parameters
    • Differences between a ~4m word corpus and a ~100m word corpus

The best classification results I’ve gotten are ~89% accurate binary classification for a non-public problem and dataset, which is hugely improved from the ~75% I was getting previously. I’m in the middle of another 16 hour LM training.

I really want to implement the subword sampling at train/test time, but Python chops aren’t up to par.

2 Likes

I guess a hacky way of doing subword sampling at train time would be to make N different samplings for corpus and just concat them together, treating them as one giant corpus. If you do ~5 resamplings, you’d have to probably decrease total epoch count to prevent overfitting, but this should be conceptually similar if I’m not misunderstanding something.

edit: no need to reduce bs, because memory taken up is bs * bptt, total corpus size doesn’t matter

2 Likes

The paper used stoke to learn Chinese and pointed out the shortfall of using image to learn.
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

cc: @shoof

@Moody This looks great! Thanks for sharing! Stroke n-grams including how they exploited the natural order sequence of strokes in the characters looks really interesting :slight_smile:

@jamesrequa I am thinking of how to apply this in Traditional Chinese. the most complex Transitional Chinese character (picture below) has 36 strokes. There are 595 n-grams (3 to 36-grams). :scream: The good news is the Wiki dataset is relatively small and the unique token is around 5K.

image

After trying for a while, I don’t think SentencePiece is promising for logograms. I am thinking to switch to Jieba for tokenization. Or, try something completely new. Any words of wisdom?

Hey guys. Thai language here. A little late to the party. I have just refactored my repo since I forgot to apply for part2v2. Here are some issues/questions I have and wonder if you guys are also facing them:

  • I could never train a classifier entirely unfrozen for my dataset. The best performance was from unfreezing last two layers. Why would this be?
  • I’m relying on rule of thumps/trial and error regarding the use_clr parameters (10,10 default then higher first number and lower last number if dataset is small/overfitting too much). Do you have any systematic way of doing this? I’ve read Leslie Smith’s and Jeremy’s papers but still cannot figure it out.
  • What text cleaning are you guys using? I only remove repetitive characters and that’s it.
  • For languages like Thai whose words are stuck together, how much does your tokenizer performance impacts your classification performance? I’m using a 70% accuracy tokenizer (pretty low in Thai language standards nowaways) but could get very good results. I’m suspecting you don’t need a REALLY good tokenizer but a consistent one. Planning to use sentencepiece and other tokenizers for comparison.
  • I’m trying to use the language model also as a feature extractor by using average pooling of the last lstm layer. Does this practice make sense? Is it better than averaging word vectors?
3 Likes

Hi everyone,

I’m currently trying to create a language model for Ancient Greek (yeah I know). It’s more of a test run so far. I’ve created my own ‘corpus’ from various sources in the web, mainly original greek texts for famous authors. I have been using cltk (classical nltk) for tokenization which was pretty cool. There is also another library that accurately splits words into syllables (a standard thing in all Greek words), I haven’t yet tried it but I am hoping it can offer more accuracy to the model.

My tests so far though have not been that good. Most of my models have been oscillating and reaching at best a loss of 7-8. I blame mostly my dataset for not being as good or clean but I wonder if there is something else at stake here.

Given that Ancient Greek is quite a rich language (152k tokens in 384 documents so far), is there any advice you could share about how to approach this? Maybe slightly ‘similar’ cases. Also, if anyone was aware of a cleaner Ancient Greek corpus let me know! In the meantime, message me if you want the one I’ve constructed.

Kind regards,
Theodore.

1 Like

Hi guys,

I’m working on building LM for Japanese. Instead of building LM from scratch, I’ve used Mecab+SentencePiece(word level) for tokenization, LM for AWD-LSTM.
Since Japanese wiki was too sparse, I had to limit vocab size to 32000 to fit my 15.5GB RAM. After training AWD-LSTM(with default parameters) only for 2 epoch, I didn’t get meaningful text generation. So surely I will train for longer.

My question is related to lesson 10 where @jeremy used pre-trained weights from AWD-LSTM wt103 english model. I would like to save my weights and load later PRE_LM_PATH = PRE_PATH/‘fwd_wt103.h5’ like this but I’m not sure how. Loading model with torch.load seem doesn’t have any weight related parameters:

RNNModel(
  (lockdrop): LockedDropout()
  (idrop): Dropout(p=0.65)
  (hdrop): Dropout(p=0.3)
  (drop): Dropout(p=0.4)
  (encoder): Embedding(32000, 400)
  (rnns): ModuleList(
    (0): WeightDrop(
      (module): LSTM(400, 1150)
    )
    (1): WeightDrop(
      (module): LSTM(1150, 1150)
    )
    (2): WeightDrop(
      (module): LSTM(1150, 400)
    )
  )
  (decoder): Linear(in_features=1150, out_features=32000, bias=True)
)

You could try lowering drop rates substantially and see if it starts learning faster. With this high dropouts as you currently have it could be quite difficult for the model to learn.

1 Like

What you’re seeing there is just the modules. You can get a list of parameters by using [n for n,p in model.named_parameters()] or so, but that would not include batch norm averages etc, you would want model.state_dict().keys() for that. Saving and loading should work as in Jeremy’s and the other shared notebooks (learner.save). The fast ai lib saves and loads state_dicts (so just the weights, not the model structure), so you would use torch.load/save and model.state_dict()/load_state_dict() if you want to stay compatible with that. (The h5 extension seems slightly inaccurate as I don’t think it uses hdf5 as a file format, the latter being commonly done with keras.)

2 Likes

Hi.

I have a doubt about the next word predicted by the language models we are builting. It seems I cannot achieve the same results Jeremy achieved with the Arxiv notebook. I created a portuguese language model and indeed got better results in fine tuning for a law text corpus and using it for classification purposes. The Portuguese LM achieved the same level of perplexity as other people achieved for other languages here.

But I am intrigued about the way the language model works.

In the Arxiv notebook, when the context changed (eg. category csni x category cscv) the model come with different words for the next prediction for the same sequence, depending on the category context.

When I try to emulate that using my Portuguese LM or even the Law LM fine tuned from the Portuguese LM, I always get the same prediction based only on the last word of the seed sentence. It doesn’t matter what came before. If the last token of my seed sentence is “to”, I will always get the same predictions for the next word (exactly the same probabilities), despite the previous context. I used the same code for making the predictions as the Arxiv Notebook, and when I ask it to predict the next 50 words, the sentences are coherent, so I guess the code is working ok. But I always get the same results depending only on the last word of the seed, despite the previous context.

Is that expected? What is different? Maybe using 1 cycle policy? I see in Arxiv notebook, Jeremy took a lot of epochs to converge, while using 1 cycle policy my model converged (overfitting) in 15 epochs. Has it something to do with the different results?

Thanks in advance.

Best regards to all,
Fabio.

3 Likes

Thank you so much. After saving the model with model.state_dict(), I was able to get weights.

After training for 10 epochs, I still couldn’t get meaningful text generation. Forex:
‘file 廃止 村 町 英語 通り て 版 月 構成 四 軍 世帯 nhk 代 歴史 メンバー 場合 and 作 上 中 人物 人物 等 フランス 四 店 新 使用 行っ 世 情報 による of 回戦 ば れる 著書 地名 cd 的 開発 獲得 によって 試合 お である 出版 回 東京 それ 一 か に 発売 センター 社 代 など アメリカ合衆国 シングル 一覧 世紀 なり 系 選手 その他 概要 出場 テレビ 略歴 ず km 死去 へ 交通 著書 出典 市立 時代 の 英語 それ れ と して 曲 開始 いる 道 現在 世界 参加 軍 大会 機 出演 完成 他 男子’

I’m wondering where the problem could be. I suspect followings:

  1. mecab tokenization is not good enough. So feeding the result of mecab tokenization to sentencepiece probably made it even worse.
  2. Limiting vocab to only 32000 might be too few to represent whole wiki.
  3. Parameters. Trained with default parameters from AWD-LSTM, and one in lesson 4 of fastai. Both didn’t give good result. I see my loss is getting less starting from 8 to 4.7.

Things I may should try:

  1. Longer training(more than 10 epoch)
  2. Bigger vocab size with different tokenization method
  3. Try building character based LM.

It would be great if you guys(including @jeremy :slight_smile: ) suggest what kind of method I should try. Thank you.

1 Like

I don’t know enough about Japanese to tell, but is the expectation of the pipeline to have mecab + sentencepiece or would you have just one or the other? My understanding of the sentencepiece readme is that using sentencepiece without pre-tokenization for Japanese is an intended use case, so it might also be worth asking them.