Initially I used 1. from Tsinghua University’s NLP group to train the LM and do classification with sentencepiece. I used a 32k vocab SP tokenizer and got ~70% accuracy for classification, and it was not the best experiment.
I didn’t manage to find a Chinese equivalent of imdb or Yelp. The standard datasets and benchmarks for text classification in Chinese seem a bit all over the place, and results from here seem kind of like the SOTA, but it used only reported the test performance on an arbitrarily chosen subset of the dataset mentioned above (average F1 95% seems hard to beat).
Now I am ready to retrain the LM using fastaiv1 and wikipedia, then the classifier using the datasets that were used in Yann LeCun’s papers.
So far my effort has been:
Created 8k, 20k, 32k SP tokenizers using wikipedia.
Tokenized wikipedia text using the 3 tokenizers.
Train LM using the tokens generated from 2.
Transfer Learning using the LM on tasks used in dataset 2.
Train 4-layered LMs and do transfer learning.
Create more SP tokenizers and increase vocab to higher like 50k, 70k, etc just to see the trend and repeat 2-5.
Really good experiments, we added another hidden layer to ulmfit to model words in German and Polish, and this was quite helpful. I’m going to work next week on integrating sentence piece with fastaiv1 as well If you have some results before please share!
Thanks @piotr.czapla! I just re-read your paper and you did mention the 4-layered arch was better on a large dataset, but I only remembered the figure showing the 3 and 4-layered ones were similar Just added that to the experiments so I have a 6 (vocab) x 2 (layers of LM) matrix of experiments to run.
@piotr.czapla@Moody I’m curious to see how ULMFiT compares with BERT now that we have pre-trained BERT models. Google didn’t use SP for Chinese and essentially built the model using character level encoding. I don’t know how good that can be. Will find out!
It’d be amazing to work on adding SP tokenization to ULMFiT! Thanks for the link although I think it may be a private group. If you are in it, can you please invite me?
Sorry, you don’t have access to that topic!
I don’t know how helpful I can be on this project but I’d love to understand SP and ULMFiT more, and diving into it is probably the best way to go!
Have you tried Sliced RNN that came out this year? 136X faster than standard RNN feels like an absolute highlight, and it seconds your comment on the speed advantage using QRNN over BERT. Moreover, even transfer learning on BERT may require multiple GPUs or TPUs. Seems soooo Google, lol.
The SRNN author has an Github repo using TF and Keras. I haven’t found anything on the forum about it yet. Maybe I’ll start a new thread and see whether others have experience with it.