Initially I used 1. from Tsinghua University’s NLP group to train the LM and do classification with sentencepiece. I used a 32k vocab SP tokenizer and got ~70% accuracy for classification, and it was not the best experiment.
I didn’t manage to find a Chinese equivalent of imdb or Yelp. The standard datasets and benchmarks for text classification in Chinese seem a bit all over the place, and results from here seem kind of like the SOTA, but it used only reported the test performance on an arbitrarily chosen subset of the dataset mentioned above (average F1 95% seems hard to beat).
Now I am ready to retrain the LM using fastaiv1 and wikipedia, then the classifier using the datasets that were used in Yann LeCun’s papers.
So far my effort has been:
- Created 8k, 20k, 32k SP tokenizers using wikipedia.
- Tokenized wikipedia text using the 3 tokenizers.
- Train LM using the tokens generated from 2.
- Transfer Learning using the LM on tasks used in dataset 2.
- Train 4-layered LMs and do transfer learning.
- Create more SP tokenizers and increase vocab to higher like 50k, 70k, etc just to see the trend and repeat 2-5.