Continuing the discussion from Multilingual ULMFiT:
I’ve been working on applying ULMFiT to Japanese language using fast.ai v1.
The code and the pretrained models can be found here.
Summary
- Pretrained a sentencepiece model for tokenization using Wikipedia dump
- Pretrained a language model using Wikipedia dump (100M tokens)
- Fine-tuned the language model and trained a classifier for the following datasets
- Aozora Bunko
- MedWeb
Details of the classification tasks
I could not find any famous benchmark datasets for classification tasks in Japanese.
So I chose two publicly available datasets: Aozora Bunko and MedWeb.
Aozora Bunko
Aozora Bunko is a digital library with a large collection of Japanese books. I used books with expired copyrights for this task.
I followed the task introduced in this repository, in which the aim is to predict the author of a given line of text from 5 candidates.
There is no public leader board.
MedWeb
MedWeb is a collection of pseudo-tweets about diseases/symptoms. The task is to predict the diseases/symptoms implied in each tweet under multi-label setting.
There was a competition using this dataset and the leader board is available.
SOTA is F1_micro = 0.920 and my best score was 0.893, which ranks #4 on the leader board.
Feel free to use the code/pretrained models and let me know if you have any comments or questions!