SPProcessor Error

(Zachary Mueller) #1

Hi all, I am trying to use Sentence Piece on the IMDB data, and I create my model as such (after installing sentencepiece):

data_lm = (TextList.from_folder(path, processor=SPProcessor())
            .filter_by_folder(include=['train', 'test', 'unsup'])
          .split_by_rand_pct()
          .label_for_lm()
          .databunch(bs=bs, num_workers=2))

However, when I go to label_for_lm() it fails, giving me the following:

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in train_sentencepiece(texts, path, pre_rules, post_rules, vocab_sz, max_vocab_sz, model_type, max_sentence_len, lang, char_coverage, tmp_dir)
    431         f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
    432         f"--user_defined_symbols={','.join(spec_tokens)}",
--> 433         f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
    434     raw_text_path.unlink()
    435     return cache_dir

RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(498) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 

Any idea what may be happening? (Am I even using it correctly?)

0 Likes

(Zachary Mueller) #2

@sgugger the fix involves f"--hard_vocab_limit={False}", on the SentencePieceTrainer.Train call. I can put a PR into this fix.

0 Likes

(Nathan) #3

@muellerzr I don’t think that is the issue, or at least not the whole issue. I just attempted to use sentence piece as well:

processor = SPProcessor(lang="en", vocab_sz = 10000)
data_lm = (TextList.from_folder(path, processor=processor)
....

and it managed to work with a lower vocab size, however, when I go to show the batch:
data_lm.show_batch()

it tokenizes the file names if using the “from_folder” method:

Any clue on a fix?

0 Likes

(Zachary Mueller) #4

Along with spp you need the OpenFileProcessor() to say you want the text in the files. I wound up following the Turkish notebook for it to work.
Here is mine: I used SentencePiece and Spacy on the IMDb

0 Likes

(Nathan) #5

worked like a charm. Thanks!

0 Likes