ULMFiT - Japanese

s.tsuruno · January 21, 2019, 10:44am

Continuing the discussion from Multilingual ULMFiT:

I’ve been working on applying ULMFiT to Japanese language using fast.ai v1.
The code and the pretrained models can be found here.

Summary

Pretrained a sentencepiece model for tokenization using Wikipedia dump
Pretrained a language model using Wikipedia dump (100M tokens)
Fine-tuned the language model and trained a classifier for the following datasets
- Aozora Bunko
- MedWeb

Details of the classification tasks

I could not find any famous benchmark datasets for classification tasks in Japanese.
So I chose two publicly available datasets: Aozora Bunko and MedWeb.

Aozora Bunko

Aozora Bunko is a digital library with a large collection of Japanese books. I used books with expired copyrights for this task.
I followed the task introduced in this repository, in which the aim is to predict the author of a given line of text from 5 candidates.
There is no public leader board.

MedWeb

MedWeb is a collection of pseudo-tweets about diseases/symptoms. The task is to predict the diseases/symptoms implied in each tweet under multi-label setting.
There was a competition using this dataset and the leader board is available.
SOTA is F1_micro = 0.920 and my best score was 0.893, which ranks #4 on the leader board.

Feel free to use the code/pretrained models and let me know if you have any comments or questions!

gradstudentdescent · January 23, 2019, 12:15pm

When I run the 20181126_pretrain_lm notebook, I get the error:
`---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
in ()
----> 1 from fastai import F

ImportError: cannot import name ‘F’`

Do you know what the reason for this could be? My fastai version is 1.0.41

s.tsuruno · January 24, 2019, 1:44am

My fastai version is 1.0.25.dev0, so the difference in our versions could be the cause of the error.
Can you try using the older version of fastai?
I’m sorry for the inconvenience.

gradstudentdescent · January 30, 2019, 6:52am

@s.tsuruno Thanks, that worked!

gradstudentdescent · January 30, 2019, 7:10am

@s.tsuruno although now I am encountering the problem NameError: name '_NVRTCProgram' when trying to run learn.lr_find() in the 20181213_Aozora_classification_assessment_vocab8k notebook. Did you encounter the same problem? cupy is installed in the same fastai environment.

s.tsuruno · February 4, 2019, 6:57am

@gradstudentdescent I haven’t encountered that error. It seems to be related to CUDA runtime compilation. Have you confirmed if CUDA is working properly?

piotr.czapla · February 12, 2019, 3:48pm

@s.tsuruno we are making a summary of ulmfit efforts. Please have a look here: Multilingual ULMFiT I’ve added you to the thread. Your datasets don’t fall under raw category Sentiment Analysis or News Classification but the results are really interesting and It would be good to include them in the summary.