SPProcessor Error

(Zachary Mueller) #1

Hi all, I am trying to use Sentence Piece on the IMDB data, and I create my model as such (after installing sentencepiece):

data_lm = (TextList.from_folder(path, processor=SPProcessor())
            .filter_by_folder(include=['train', 'test', 'unsup'])
          .databunch(bs=bs, num_workers=2))

However, when I go to label_for_lm() it fails, giving me the following:

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in train_sentencepiece(texts, path, pre_rules, post_rules, vocab_sz, max_vocab_sz, model_type, max_sentence_len, lang, char_coverage, tmp_dir)
    431         f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
    432         f"--user_defined_symbols={','.join(spec_tokens)}",
--> 433         f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
    434     raw_text_path.unlink()
    435     return cache_dir

RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(498) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 

Any idea what may be happening? (Am I even using it correctly?)


(Zachary Mueller) #2

@sgugger the fix involves f"--hard_vocab_limit={False}", on the SentencePieceTrainer.Train call. I can put a PR into this fix.