I’ve pretrained a language model for Japanese at our company and would like to contribute it to the model zoo.
However, the official zoo is not open yet and I haven’t been able to figure out what to do.
@piotr.czapla Can you kindly point me to what I should do next?
Here’s what I’ve done so far:
- Cloned @piotr.czapla’s ulmfit-multilingual repo (https://github.com/n-waves/ulmfit-multilingual/projects/1)
- Created a local branch
- Refactored the code to use sentencepiece tokenizer instead of Moses tokenizer (on Japanese Wikipedia) before lm pretraining
- Pretrained the language model on Japanese Wikipedia
- Fine-tuned + classified MedWeb (medical tweets) and Aozora-bunko (license-free books) datasets