Using SentencePiece for NLP preprocessing error

I am trying to use SentencePiece to preprocess some sequence data, but I am getting an error message

Permission denied: ““tmp/spm”.model”: No such file or directory Error #2

Does anyone know why this might be? how to fix it?
code below

!pip install sentencepiece
from fastai import *
from fastai.text import *

url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&limit=500&columns=sequence'
#url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&columns=sequence'

seqs = pd.read_csv(url)

TextList.from_df??
bs=128
data = (TextList.from_df(seqs, processor=[OpenFileProcessor(), SPProcessor()])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm()
        .databunch(bs=bs, num_workers=1))

This makes be me believe that sentencepiece is not able to perform subword tokenization. My understanding is SPProcessor need to create a tmp/spm.model and tmp/spm.vocab in the directory where this notebook is located.

Also the code is using fastaiv1. And I haven’t used sentencepiece in V1 version.

Need additional information to diagnose further. Please review this How to debug your code and ask for help with fastai v2

Do you have any fixes?

I have recently used SubwordTokenizer which according to the source code actually uses SentencePieceTokenizer, however, in fastai v2 (which I stronly recommend).

See this post: Tokenizer with pretrained vocab in fastai

When you create your dataloaders (i.e. databunch in fastai 1 terminology) with SubwordTokenizer, fastai trains a tokenizer on your corpus and saves it under 'tmp/spm.model' for later use.

3 Likes

Ah I see. Yes I was going off of an nlp notebook jeremy wrote last year so I guess it was using the older version of fastai.

if I was training a language model from scratch do I need to train the tokenizer separately? or can I just include tok=sentencepiecetokenizer() as an argument in my Datablock?

I don’t think you need to explicitely train the tokenizer as I did in the post I shared (there I just wanted to demonstrate how to load a previously trained tokenizer). Setting up the dataloaders with tok=SubwordTokenizer() should create and save the tokenizer automatically.

If I want to substantiate the dls object again without it retraining the subword tokenizer again, bow would I do that? Would it just be tok=‘tmp/some.model’

Thanks!