And ULMFiT has superior results on the CLS dataset: https://arxiv.org/pdf/1909.04761.pdf
Anyone is willing to share pretrained (english) ULMFIT or MULTIFIT LM weights, with the SentencePiece tokenizer?
Update:
I trained it myself https://www.kaggle.com/manyregression/fastai-en-wiki-500kk-pretrained-sp
Thereās also more in the versions of this notebook - 100kk tokens, awd-lstm weights
https://www.kaggle.com/manyregression/sp-wikitext-vocab-lm-ipynb?scriptVersionId=27995530
Another question - thereās no point to use pretrained https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd if I chose SentencePiece, right?
Correct. You need consistent indices and tokens for encoding (training) and decoding (inference).
Funny, but I got slightly worse results when I fine tuned pretrained Spacy weights with SP and the ntrained a classifier https://www.kaggle.com/manyregression/fastai-ulmfit-google-quest-classifier-spacy?scriptVersionId=27771121
Any ideas why ULMFIT english regression model pretrained from 500kk wiki tokens failed while 100kk gave just worse results?
Hereās 500kk version https://www.kaggle.com/manyregression/fastai-ulmfit-google-quest-sp?scriptVersionId=28040078
For 100kk, the spearman metrics was 0.26 at best.
Hi I built a persian language model
here is the topic
Hi, Iām interested in knowing about your work. Iām a phd student in Tehran University.
Could one guide me how to implement MultiFit for a new language (the Persian language).
This is the notebook
It reads a pretrained model for Japanese, but I guess there is not such a model for Persian. Also, I donāt know the format of the models. I found a pretrained model for Persian in the following link,
however I donāt know if the model fits the project above?
I was so glad to have Ines and Matt presenting in-person about the new features of spaCY v3.0. Highlights include the data pipeline to store all the configures and hyper parameters in one place and APIs with other popular open source tools (such Weighs and Bias and FastAPI). My favorite feature is the ability to build (ie hard code) your own acronyms for specific domains or use cases. Enjoy!