Attention folks working on NLP: starting with 1.0.38 fastai now assumes batch is the first dimension everywhere (it helps us in terms of API behind the scenes) which means text too. So some scripts might need a bit of adjustment (everything in fastai has been updated for that change).
This should be the last breaking change and you can expect a more stable API now.
I looked at the paper you cited no mentions about the amount of english text in the dataset. What makes me a bit confused are the reviews, like the ones in the photo below, that your model correctly classifies.
Model was trained on part of Arabic Wikipedia dump. The articles contain some English (and less of other languages). If you look at the text that has “are the only good things in this hotel”, this review is around 400 words long, 170 are Arabic and the rest English, so most likely class is determined by the Arabic part.
Progress on Thai.
Many breaking changes as I was training so I kept using 1.0.22 (will now use 1.0.38 for more stable API). For my case, LSTM still performs better than QRNN even if the latter has faster training performance. There is no finetuned BERT performance published on the benchmark Wongnai dataset yet, but compared pretrained BERT, ULMFit still has state-of-the-art performance.
Thank you very much for your clarification. Nevertheless I don’t know if this is a good move for people who desire to bring fastai models in production since onnx export doesn’t seem to work for now with a bach_first option set to true. Several issues have been raised on the pytorch github.
@pjetro I haven’t noticed your question before here are some answers:
Hie Tomek, I didn’t get your question. It is meant to be used on different languages with or without spacy the point is that we are going to run some experiments to see, what give us the best results. Sebastian initially wanted to keep Moses tokenization, and sentence piece as he thought it will get us better accuracy on local languages, I’ve added spacy so we can compare the performance.
Yeah, we play a bit to see how ULMFit works on XNLI and if it can be improved using tricks from papers like ELMO. I’ve started this repo so we can do that without disturbing the course, and once we have things that work we can contribute them to fastai. (hence we have fasta_contrib package).
What is important to me is that we can replicate the fantastic classification results on other languages using one repository and maybe we can tackle XNLI as it is essentially a more advanced classification.
What tweaks do you have in mind?
I think we are very close to the paper results, my models on imdb get the following error rates:
5.4% sentencepiece and 4 layer lstm
5.1% moses + fastai preprocessing
5.2% fastai tokenization (starting from wt103)
And the performance reported in the paper is 5.0 to 5.3 on a single model, see the quote below:
Impact of bidirectionality At the cost of training a second model, ensembling the predictions of a forward and backwards LM-classiﬁer brings a performance boost of around 0.5–0.7. On IMDb we lower the test error from 5.30 of a single model to 4.58 for the bidirectional model.
Sure, we have this repo separate to be able to have more ppl to contribute without making it harder for Sylvain and Jeremy to manage. Although I was hoping that it will be used mostly to get the multilingual polished, how about we get on a chat and discuss?
Piotr, thank you for your answer. Sorry for the delay, just got back from a long Xmas break.
The first part of my post was not really a question, more of a sanity check about what that repo is meant to be.
I want to focus on tasks with a small dataset for the downstream task (+ short documents)
for a start, I wanted to get rid of the sequence padding in the classifier, by using PackedSequences. I know there was a failed attempt, but I think it could work and be beneficial. In progress now.
in LM fine-tuning, the current approach of just concatenating everything into 1 giant text might not be optimal for short sequences
when fine-tuning on a small dataset, we effectively throw away all embeddings that are not present in the downstream task’s training set. We could keep them, up to some number (the 30k/60k typical limit) and have less UNK tokens in test. I think it would be interesting to see if those values not fine-tuned embedding values stay somewhat relevant
I was also thinking about jointly training the classifier and the language model (for the downstream task) as a way of sort-of regularization. I guess this is a long shot, and not easy to implement.
Any comments? Do e.g. do you already know that something from the list will not work?
What settings/hyperparameters do you use? Could you share the commands used? And have you uploaded the trained models somewhere?
Also, what is the status on bidirectional models and classifiers?
Sure, but what chat do you mean?
BTW the minor changes I wanted to make already were mostly in README, but they might have gotten outdated now. I will check tomorrow.
am I missing something, or is there currently no way to train a backwards LM (and classifier)? I can try to add it, just want a confirmation
you’ve mentioned starting from wt103. You meant this wt103_v1 version, right? Or the old one, without the_v1 suffix?
BTW, not sure if this is a common view or not, but I do not really like using those pretrained models, as it is not clear to me how exactly they were trained. Clear how something was trained == able to reproduce it (input data, software versions, all parameters, etc)
Maybe for ulmfit-multilingual, when pretrained models are ready to be published, it would be good to publish them along with some detailed instructions on how to reproduce? Or perhaps Dockerfiles, to have the environment better controlled?
Hi all, I did some searching in the forums and online but did not find any reults, apologies if this has been asked and answered before.
Has anyone tried fine tuning twice ? Would it be logical to assume that training a LM on a large corpus (for instance Spanish Wiki) then fine tuning that LM on a smaller corpus (Mexican-Spanish text taken from a scraped from local news articles) then finally fine tuning to your text classification dataset and predicting the different classes. I’m thinking that this approach, conceptually, should assist in under resourced languages.
That is pretty much the intention of ULMFIT
A)train on a big corpus so the model understands the language
B) fine tune on domæne specific task
B.1) one part without classes but from the specific domaine. fx imdb reviews without classes
B.2) one part with classes. fx imdb review with classes
The idea og B.1 is to reduce the need for text with classes as these are often more sparse.
If however you think of B.1 as a second large generic corpus then i would merge it with A and training both the tokenizer and the neural net with both upfront
Your assumption of training a LM on different corpus holds as long as those corpus use the “same” vocab. In the text classification example from the lesson the goal of fine tuning the LM over the reviews is to let the LM fit better the language used inside the very reviews. Doing so the signal(input) it is going to provide to the classification “head” will be more accurate. So it is hard to see how this approach might assist tasks over different under-resourced-languages.
Thanks for the response @fabris. My aim regarding the under resourced languages was to use a well resourced language ( eg Spanish) to train a LM which can be fine tuned to a derivative or dialect ( eg Mexican-Spanish) of this well resourced language. This resulting LM would then hopefully have knowledge of structure of Mexican-Spanish ( which it would have learned from the Spanish model, plus the common vocab) this model could then be used to fine tune a Mexican-Spanish classifier.
At least this is my logic, was just curious if anyone tried a similar approach and got good/bad results. If this doesn’t sound like a terrible idea to the folks on the forum then I’ll give it a bash and report on my results.