SPProcessor Error

Hi all, I am trying to use Sentence Piece on the IMDB data, and I create my model as such (after installing sentencepiece):

data_lm = (TextList.from_folder(path, processor=SPProcessor())
            .filter_by_folder(include=['train', 'test', 'unsup'])
          .databunch(bs=bs, num_workers=2))

However, when I go to label_for_lm() it fails, giving me the following:

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in train_sentencepiece(texts, path, pre_rules, post_rules, vocab_sz, max_vocab_sz, model_type, max_sentence_len, lang, char_coverage, tmp_dir)
    431         f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
    432         f"--user_defined_symbols={','.join(spec_tokens)}",
--> 433         f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
    434     raw_text_path.unlink()
    435     return cache_dir

RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(498) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 

Any idea what may be happening? (Am I even using it correctly?)

@sgugger the fix involves f"--hard_vocab_limit={False}", on the SentencePieceTrainer.Train call. I can put a PR into this fix.

1 Like

@muellerzr I don’t think that is the issue, or at least not the whole issue. I just attempted to use sentence piece as well:

processor = SPProcessor(lang="en", vocab_sz = 10000)
data_lm = (TextList.from_folder(path, processor=processor)

and it managed to work with a lower vocab size, however, when I go to show the batch:

it tokenizes the file names if using the “from_folder” method:

Any clue on a fix?

Along with spp you need the OpenFileProcessor() to say you want the text in the files. I wound up following the Turkish notebook for it to work.
Here is mine: I used SentencePiece and Spacy on the IMDb

1 Like

worked like a charm. Thanks!

I got the same error and I am thinking why it happens. My corpus is large, so the vocab size should be definitely bigger than 15k I set.

One thing I see in SP logs is too long lines.
Does it mean it skips almost all the text for training? Partly it does, hence the low vocab size throwing the same error you had. But I also split the text in files of 20480 size which is the sentence limit.

Loading corpus: data/en-100/tmp/all_text.out
trainer_interface.cc(287) LOG(WARNING) Found too long line (23209 > 20480).
trainer_interface.cc(289) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(290) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.
trainer_interface.cc(315) LOG(INFO) Loaded all 73 sentences
trainer_interface.cc(321) LOG(INFO) Skipped 24216 too long sentences.

Limiting text fields to 10240 length helps, it seems something is added before SP which increases the length. Although I don’t know what, I checked all_text.out file and it looks good.