Great, that should improve val_loss! If you finish training, please upload the itos file as well which allows fastai to create a new vocabulary with the new, finetuning dataset.
BTW: I am down to collaborate and do some of the training myself if helpful. I think having a great pretrained spanish language model is super important for a number of downstream tasks.
@imaginary @Andreas_Daiminger would you be interested in collaborating on training a SOTA TransformerXL for Spanish? I think our objective should be to improve the impressive 18.5 perplexity achieved by Andreas using LSTM.
I would definitely be interested in training Transformer XL in Spanish. It is much more computationally intensive than ULMFiT though.
Does anybody know if there is a pre-trained Spanish BERT already?
I have 300USD in Google Credits, do you think this is enough? I would need to train a backwards model as well, since I am trying to create a model that generates rap lyrics and this is necessary for rhymes.
This is about 200 hours on a GCP P100 instance. Hard to say how far you can get with this.
I know that BERT was trained with an insane amount of computation and on a huge Dataset. … not only Spanish Wikipedia. Might be very hard to compete with that. I also read that they will release pretrained BERT in Spanish at some point. Have you tried BERT multilingual for the task? Some researchers at my job are getting really good results for question answering with it.
Thank you! I am very happy to collaborate! I was thinking we could expand the dataset by taking the Spanish pages from the list collected by jcpeterson’s OpenWebText. An easy filter could be the country domain, though I don’t know how good that could be.
Where is this available?
You can find it here although I don’t think you can use this repo for multilingual LM.
Interesting. I did not try BERT multilingual because I believe the model cannot be used for language modelling (I found API’s for other uses such as question answering). BERT was pretrained on masked language modelling and next sentence prediction, not to be used on classic language modelling. Anyways, I think that at some point it would be cool to have a state of the art model trained in Spanish in the fastai library. I don’t know if BERT or GPT2 are to be added to the library soon.
This is a very interesting approach and I speculate that there would be much more data available with OpenWebText. We could even combine the datasets if necessary (Reddit + Wiki103).
If you want more data, I’ve used the OpenSubtitle before to make a chatbot. Its a dataset of movie subtitles and the pairing between them. The spanish monolingual dataset has 1.3G tokens, +10 time wikitext103.
Hi @imaginary , all,
Do you know if there is an updated version of what you shared on Drive?
It was a great start for me and now that I am taking NLP - transfer learning more seriously I was wondering if there is a more updated spanish baseline?
Thanks a lot!
I think they are trying to make baseline models of all languages in one place
I am taking a look, but I think it is a bit over my current level to contribute to/expand the project.
I was looking for more recent versions of the pretrained files to be used in the language_model_learner function. Below an example of this call. I am looking for the files that are named “FILE_LM_ENCODER” and “FILE_ITOS”.
learn = language_model_learner(data_lm, AWD_LSTM, pretrained_fnames=[FILE_LM_ENCODER, FILE_ITOS], drop_mult=0.3)
but I have not been able to find a repository with updated ready-to-use baselines for different languages (Spanish in my case).
Thanks in advance for any hint!
I have a problem no using the encoder and itos files from @imaginary that I have been using with no issues during the last few months.
Same code, same pretrained files and now language_model_learner is not working … do you know if anything has changed in the library?
I posted this question in the link below but I thought it was relevant here.
Hi again, I couldn’t find the pretrained models on that github account?
Maybe it is only instructions on how to generate new ones but the actual models are elsewhere?
I’ve trained a Transformer XL in Spanish with v1.0.57
with a wiki corpus with ~500M tokens
It achieved 43% of accuracy (18,8 perp)
It’s all in this repo:
Maybe a couple of extra epochs could improve its performance
I also trained a classifier. (my notebook is a bit messy but if it helps anyone I’ll be happy)
Hi Maria how much time did it take for 5 epochs?
About 10 hours each (50h). It took a week, because I had to try with different combinations of learning rates and momentum.
Hey Maria, great work! Why did you choose 0.5 drop_mult? I’ve seen it around but don’t really know where that setting has come from!