ULMFiT - Japanese


(Shun Tsuruno) #1

Continuing the discussion from Multilingual ULMFiT:

I’ve been working on applying ULMFiT to Japanese language using fast.ai v1.
The code and the pretrained models can be found here.

Summary

  • Pretrained a sentencepiece model for tokenization using Wikipedia dump
  • Pretrained a language model using Wikipedia dump (100M tokens)
  • Fine-tuned the language model and trained a classifier for the following datasets
    • Aozora Bunko
    • MedWeb

Details of the classification tasks

I could not find any famous benchmark datasets for classification tasks in Japanese.
So I chose two publicly available datasets: Aozora Bunko and MedWeb.

Aozora Bunko

Aozora Bunko is a digital library with a large collection of Japanese books. I used books with expired copyrights for this task.
I followed the task introduced in this repository, in which the aim is to predict the author of a given line of text from 5 candidates.
There is no public leader board.

MedWeb

MedWeb is a collection of pseudo-tweets about diseases/symptoms. The task is to predict the diseases/symptoms implied in each tweet under multi-label setting.
There was a competition using this dataset and the leader board is available.
SOTA is F1_micro = 0.920 and my best score was 0.893, which ranks #4 on the leader board.

Feel free to use the code/pretrained models and let me know if you have any comments or questions!


Language Model Zoo :gorilla:
Language Model Zoo :gorilla:
#2

When I run the 20181126_pretrain_lm notebook, I get the error:
`---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
in ()
----> 1 from fastai import F

ImportError: cannot import name ‘F’`

Do you know what the reason for this could be? My fastai version is 1.0.41


(Shun Tsuruno) #3

My fastai version is 1.0.25.dev0, so the difference in our versions could be the cause of the error.
Can you try using the older version of fastai?
I’m sorry for the inconvenience.


#4

@s.tsuruno Thanks, that worked!


#5

@s.tsuruno although now I am encountering the problem NameError: name '_NVRTCProgram' when trying to run learn.lr_find() in the 20181213_Aozora_classification_assessment_vocab8k notebook. Did you encounter the same problem? cupy is installed in the same fastai environment.


(Shun Tsuruno) #6

@gradstudentdescent I haven’t encountered that error. It seems to be related to CUDA runtime compilation. Have you confirmed if CUDA is working properly?


(Piotr Czapla) #7

@s.tsuruno we are making a summary of ulmfit efforts. Please have a look here: Multilingual ULMFiT I’ve added you to the thread. Your datasets don’t fall under raw category Sentiment Analysis or News Classification but the results are really interesting and It would be good to include them in the summary.


(Daisuke Shimamoto) #8

@s.tsuruno Do you happen to have the train/validation losses and accuracy when pre-training on Wikipedia?

I’ve been playing around with ULMFiT (tokenized with Mecab + NEologd instead of sentencepiece like yours).

I got about

train_loss 	valid_loss 	accuracy
3.423046 	3.652719 	0.376947

aftter 10 epochs but not sure if that’s good or bad.