MULTIFIT - Runtime error: Permission denied

tbone · June 7, 2020, 12:25am

I would like to try Multifit in a different language, but first, to make sure everything worked as it should, I decided to try MULTIFIT in French according to Pierre’s Guillou notebook:

link: https://github.com/piegu/language-models

but I have encountered a problem when:

##FORWARD
%%time
data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), 
SPProcessor(max_vocab_sz=15000)])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm()
        .databunch(bs=bs, num_workers=1))

after doing this I get the following error:

RuntimeError: Permission denied: ""/home/userz/.fastai/data/frwiki/corpus2_100/tmp/spm".model": 
No such file or directory Error #2

Of course, I do everything in Google cloud.

pierreguillou · June 7, 2020, 11:19am

Hello,
Some questions in order to help you:

which notebook did you use?
did you encounter a problem running my notebook into French or using only the code but with your own data (Polish wikipedia for example)?

tbone · June 7, 2020, 11:43am

I plan to recreate MULTIFIT for the Polish language - first train on wikipedia and then on several other large corpus that may be closer to the domain and compare results. When I tried to run your notebook on google cloud in French - lm3-french notebook with French wikipedia.
First I wanted to do it in French to make sure everything worked out.

nandakumar212 · June 7, 2020, 12:22pm

Facing the same error with ULMFiT also.There seems to be a small issue in the way sentencepiece is called - the quotemark around model prefix parameter needs to be removed. I was able to fix it by updating the sentencepiece call in site-packages/fastai/text/data.py to:

 SentencePieceTrainer.Train(" ".join([
        f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
        f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
        f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
        f"--user_defined_symbols={','.join(spec_tokens)}",
        f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))

tbone · June 7, 2020, 7:33pm

lm3-french
French wikipedia

pierreguillou · June 8, 2020, 11:46am

The code that gives an error (and that you copied/pasted in your first post) in the notebook lm3.french.ipynb is SPProcessor(max_vocab_sz=15000) in the data block.

By searching the class SPProcessor in the fastai v1 documentation, you get the link to its code souce where you can see that the object SPProcessor(max_vocab_sz=15000) imports SentencePieceTrainer and SentencePieceProcessor, and create a temporary folder tmp (in order to store the model and vocabulary that will be found by training the SentencePiece tokenizer).

By searching in the fastai v1 github (or by reading the post of @nandakumar212 ), you’ll find that SentencePieceTrainer is changed in the file data.py in the folder https://github.com/fastai/fastai/tree/master/fastai/text.

By searching in the file data.py, you’ll find at the line 431, the following code:

SentencePieceTrainer.Train(" ".join([
    f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
    f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
    f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
    f"--user_defined_symbols={','.join(spec_tokens)}",
    f"--model_prefix={quotemark}{cache_dir/'spm'}{quotemark} --vocab_size={vocab_sz} --model_type={model_type}"]))

And as @nandakumar212 said, you can try updating this code (it means updating the file data.py) by removing {quotemark} from the line f"--model_prefix={quotemark}{cache_dir/'spm'}{quotemark} --vocab_size={vocab_sz} --model_type={model_type}". You’ll get:

SentencePieceTrainer.Train(" ".join([
    f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
    f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
    f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
    f"--user_defined_symbols={','.join(spec_tokens)}",
    f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))

I did not try but @nandakumar212 did and it worked. (PS: at the time I created my notebook lm3.french.ipynb one year ago, I did not get any problem. I guess the code of the file data.py was changed after that.)

@nandakumar212: you should open an issue in fastai v1 github with your solution which can possibly help a lot of people.

nandakumar212 · June 8, 2020, 12:10pm

@pierreguillou i found a similar issue in github, mentioned this in the comments

tbone · June 19, 2020, 6:03pm

I have a little bit of another question and I didn’t know whether to put a special thread on it so I’ll ask here.
During the training (in 3 epochs - about 9 hours of training) a message about a broken Internet connection popped up, so I had to open the notebook again. The Internet had to be interrupted literally for a few seconds,because that usually doesn’t happen. So I started to train again and that’s why the question is:

If that happens again, do I have to train the model from the beginning?

The 10 training epochs will last about 26 hours so in all likelihood my internet connection may be interrupted at some point. In that case, I wouldn’t want to lose a dozen hours again.

Besides, the model is trained for quite a long time - one epoch of about 3 hours. I train with 1x tesla v100. Is that normal? In @pierreguillou notebook was about 1 hour per epoch. My accuracy in second epoch was 38%

Same again… I got following message again. The connection seems to be flawless, so I don’t know where it came from. Maybe it’s something wrong with Google Cloud?

Connection failed: A connection to the notebook server could not be established. The notebook will continue trying to reconnect. Check your network connection or notebook server configuration.

pierreguillou · June 20, 2020, 9:37am

Hello. In fact, (in real life) you can not train a nlp model for hours or days via an ssh connection from your computer due to the risks of disconnection.

To avoid facing these risks, the launch of the Jupyter Notebook server must be done from the GPU platform, not from your laptop. So, when a ssh disconnection appears (and it will appear: Internet problems, low battery of your laptop, your child tapping on it, etc.), your notebook actually continues to run on the GPU platform. Just ssh log in again to see it.

Start with Jeremy warning about that: Platform: GCP ✅ - #617 by pierreguillou
Read the GCP warning on this subject: A Code-First Introduction to Natural Language Processing 2019 - #24 by pierreguillou
Read a whole topic about that: Problem -- Connection reset by $ip Port 22
My 2 cents about how to launch tmux on the GCP platform as a solution to your problem

Hope it helps.

tbone · June 21, 2020, 11:09am

Thank you! trained model. The results are promising. This is my first NLP project so I’m really grateful for your help. Now it’s time to move on.