ULMFiT - Chinese (Simplified) in progress

Starting this thread to share the progress on the Chinese LM and classification results @piotr.czapla @Moody

Datasets:

  1. THULACNews
  2. Repo for Yan LeCun’s papers

Initially I used 1. from Tsinghua University’s NLP group to train the LM and do classification with sentencepiece. I used a 32k vocab SP tokenizer and got ~70% accuracy for classification, and it was not the best experiment.

I didn’t manage to find a Chinese equivalent of imdb or Yelp. The standard datasets and benchmarks for text classification in Chinese seem a bit all over the place, and results from here seem kind of like the SOTA, but it used only reported the test performance on an arbitrarily chosen subset of the dataset mentioned above (average F1 95% seems hard to beat).

Now I am ready to retrain the LM using fastaiv1 and wikipedia, then the classifier using the datasets that were used in Yann LeCun’s papers.

So far my effort has been:

  1. Created 8k, 20k, 32k SP tokenizers using wikipedia.
  2. Tokenized wikipedia text using the 3 tokenizers.

To do:

  1. Train LM using the tokens generated from 2.
  2. Transfer Learning using the LM on tasks used in dataset 2.
  3. Train 4-layered LMs and do transfer learning.
  4. Create more SP tokenizers and increase vocab to higher like 50k, 70k, etc just to see the trend and repeat 2-5.
8 Likes

Really good experiments, we added another hidden layer to ulmfit to model words in German and Polish, and this was quite helpful. I’m going to work next week on integrating sentence piece with fastaiv1 as well If you have some results before please share!

Thanks @piotr.czapla! I just re-read your paper and you did mention the 4-layered arch was better on a large dataset, but I only remembered the figure showing the 3 and 4-layered ones were similar :stuck_out_tongue: Just added that to the experiments so I have a 6 (vocab) x 2 (layers of LM) matrix of experiments to run.

1 Like

@piotr.czapla @Moody I’m curious to see how ULMFiT compares with BERT now that we have pre-trained BERT models. Google didn’t use SP for Chinese and essentially built the model using character level encoding. I don’t know how good that can be. Will find out!

3 Likes

Interesting experiments. I am looking forward to seeing more results from this project. Thanks for sharing!

Exactly!, SP is by google as well and seems better suited than wordpiece for Chinese as it does not need spaces in the input sentence. I will try to find out more about this.

We can as well try QRNN it is faster architecture than LSTM and gives similar results. Given that we have BERT released for Chinese, the speed might be a quite important feature of ULMFiT.

Btw. Sebastian Ruder is working on moving ULMFiT to fastaiv1 and he want’s to add sp tokenization and clean up api. Maybe you want to join the effort. If so have a look here: Multilingual ULMFiT

2 Likes

It’d be amazing to work on adding SP tokenization to ULMFiT! Thanks for the link although I think it may be a private group. If you are in it, can you please invite me?

Sorry, you don’t have access to that topic!

I don’t know how helpful I can be on this project but I’d love to understand SP and ULMFiT more, and diving into it is probably the best way to go!

Have you tried Sliced RNN that came out this year? 136X faster than standard RNN feels like an absolute highlight, and it seconds your comment on the speed advantage using QRNN over BERT. Moreover, even transfer learning on BERT may require multiple GPUs or TPUs. Seems soooo Google, lol.

The SRNN author has an Github repo using TF and Keras. I haven’t found anything on the forum about it yet. Maybe I’ll start a new thread and see whether others have experience with it.

1 Like

I’ve posted wrong link :slight_smile: here is the thread: Multilingual ULMFiT
I’m very happy that you are interested to help out.

I haven’t! Thank you! The paper looks interesting but it seems that they haven’t compared they results with QRRN and they haven’t achieved SOTA. But you should definitely ask Sebastian what he thinks

1 Like

@shoof 你在中文上做的实验怎么样了? SP代表的是什么? 你觉得ULMFiT处理中文时,分词后的文本、每个汉字作为一个独立的串或者用汉字的拼音哪种方法比较可行?能不能给个建议? 现在用该方法,中文文本分类可以达到准确率90%以上的结果吗?若果有比较好的结果希望你能继续分享 (英文我基本能看懂但是自己表的还是有些困难)

抱歉啊我的实验终止了,工作分心无术…SP指的是sentencepiece神经分词算法。之前在德语和波兰语的模型上都看到好结果(piotr就是作者),我自己试了中文的没感觉比结巴好,但有可能是优化不够!达不到90%,但好的参照数据集基线也没找到合适的。建议不用拼音,因为一音多字太不靠谱了。这篇文章挺有意思,以汉字结构为最小单元,也介绍了其他数据集,我准备试试。 你的结果呢?

1 Like

Have you tried this dataset?

1 Like

HI , can you guys share with us the progress of Chinese Lanugage model and can we experiment certain downstream tasks(sentiment classification) with them.

Jeremy pointed out there was an error in SentencePiece documentation few days ago. This possibly explained why we couldn’t train the Chinese language model properly before.