ULMFiT - Arabic

AbuFadl · December 1, 2018, 10:00am

I am trying to build an Arabic language model using instructions from https://github.com/n-waves/ulmfit-multilingual/tree/master/ulmfit running on colab (!curl https://course-v3.fast.ai/setup/colab | bash). So far, I got all the steps completed successfully. Then, I tried to run pretrain_lm.py like so:
!python -m pretrain_lm 'data/wiki/ar-2-unk' 'ar' 0 True False 60000 70 70 'ar-2' 10 True 1.0
It runs with a few lines of output:

Batch size: 70
Max vocab: 60000
Using QRNNs...
Saving vocabulary as data/wiki/ar-2-unk/models/itos_ar-2.pkl
Size of vocabulary: 50723
First 10 words in vocab: <unk>, <pad>, <eos>, ., ،, في, من, @.@, على, &quot;
Cupy not found the code will work only on CPU!
true_wd:  False
Starting from random weights

then there is an error:

....
File "/usr/local/lib/python3.6/dist-packages/fastai/text/qrnn/forget_mult.py", line 184, in forward
    return GPUForgetMult()(f, x, hidden_init) if use_cuda else CPUForgetMult()(f, x, hidden_init)
  File "/usr/local/lib/python3.6/dist-packages/fastai/text/qrnn/forget_mult.py", line 127, in forward
    self.compile()
  File "/usr/local/lib/python3.6/dist-packages/fastai/text/qrnn/forget_mult.py", line 109, in compile
    program = _NVRTCProgram(kernel.encode(), 'recurrent_forget_mult.cu'.encode())
NameError: name '_NVRTCProgram' is not defined

Am I doing something wrong in this last step? I am also using this thread to see if there are standard Arabic benchmarks to test the model later.

bachir · December 1, 2018, 11:48am

@AbuFadl I was able to train a language model for Arabic using articles from wikipedia. Hope the article will be helpful.

AbuFadl · December 1, 2018, 12:04pm

Thanks @bachir - that’s certainly helpful. Actually, I started that Kaggle dataset earlier. Now, I am trying to do it with piotr.czapla’s recent work.
Turned qrnn off and reduced bs. Still grinding …
BTW: need to preprocess Arabic text to avoid splitting words at ‘shaddah’.

AbuFadl · December 2, 2018, 10:56am

Update: trained the model on the small dataset and trying to validate on xnli. Got error and opened issue: https://github.com/n-waves/ulmfit-multilingual/issues/19
Update-2: issue closed. Some more progress, now having a new issue: KeyError: '1.decoder.bias' in convert_weights_with_prefix
Update 3:

weights issue resolved by original author.
Notebook for working code with 24.7 perplexity

Update 4: classification test on machine translated yelp reviews. Notebook

AbuFadl · December 18, 2018, 8:16pm

A Kaggle kernel based on the model is now available. The model seems to perform better than traditional methods on sentiment classification.

gradstudentdescent · January 30, 2019, 6:51am

Hey @AbuFadl, I encountered the same NameError: name '_NVRTCProgram' is not defined problem as you. How did you end up fixing this?

AbuFadl · January 30, 2019, 7:00am

Turned qrnn off (qrnn=False). If you have enough gpu memory, use fastai 1.0.42 or later, it may work. But I don’t think you need it for your language model training.