ULMFIT - Spanish


(Francisco Ingham) #1

Spanish language model achieved SOTA in the General TASS dataset. I achieved a 0.57 F1 Score in the TASS General Corpus Dataset while SOTA was 0.562 (see notebook).

My general approach was the same as Jeremy’s with the only difference of adding a ‘tweet-specific’ pre-processing step to help the model with useful tokens that might improve performance.


(German Goldszmidt) #2

What did you use as your backbone?


(Francisco Ingham) #3

Wiki Corpus in Spanish


(German Goldszmidt) #4

I wrote the Spanish model using the V1 interfaces,
and only used the first part of the new data-preparation scripts.
How are people reporting the performance of the LM models?
Current accuracy for the LM is 34%.

G


(Francisco Ingham) #5

We use perplexity as a loss function.


(JDV) #6

I’m sadly kinda limited on resources. Could someone who has trained a Spanish LM with the fastai v1 interfaces share it? It seems the formats have changed and I can’t load the pretrained model from the OP onto v1. It complains about the key 1.decoder.bias not existing.

Anyway, does anyone have a pretrained model with the v1 format? @gsg mentioned to have trained it with nice accuracy, maybe you could share yours? Thanks!


(JDV) #8

Results:

  • LSTM language model: 4 epochs, 3.140521 for val loss and 0.376913 accuracy. Perplexity was thus 23.1038
  • QRNN language model: 7 epochs, 3.193912 for val loss and 0.367543 accuracy. Perplexity was thus 24.2884

Pre-trained models can be found here along with the itos file: https://drive.google.com/open?id=1CZftqrMg-MRH9yXV7FRBv6J_NOtBiK-2

I decided to train the LM on fastai v1 myself. I ended up using G Cloud services and taking advantage of their 300 USD credits. This allowed me to set up a V100 instance and just train there. Using QRNNs resulted in ~30 mins per epoch. LSTMS were around ~1:00 per epoch. I used a wiki dump and generated a 100M training set, with a 30k vocab. All this to say there’s definitely room for improvement and anyone could go ahead and improve these results.

Shoutouts to @sgugger for guiding me along the way and fixing a bug just in time for me train.

If someone could do some baseline testing with this LM, that’d be sweet.


Language Model Zoo :gorilla:
#9

Great work!

I would love to try a Spanish language model out. Do you have any idea how I can use your saved models? In the notebook they load Fastai’s model like:

learn = language_model_learner(data, pretrained_model=URLs.WT103_1, drop_mult=0.3)

A GoogleDrive URL is not going to work there…