I am trying to use SentencePiece to preprocess some sequence data, but I am getting an error message
Permission denied: ““tmp/spm”.model”: No such file or directory Error #2
Does anyone know why this might be? how to fix it?
!pip install sentencepiece
from fastai import *
from fastai.text import *
url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&limit=500&columns=sequence'
#url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&columns=sequence'
seqs = pd.read_csv(url)
data = (TextList.from_df(seqs, processor=[OpenFileProcessor(), SPProcessor()])
This makes be me believe that sentencepiece is not able to perform subword tokenization. My understanding is SPProcessor need to create a tmp/spm.model and tmp/spm.vocab in the directory where this notebook is located.
Also the code is using fastaiv1. And I haven’t used sentencepiece in V1 version.
Need additional information to diagnose further. Please review this How to debug your code and ask for help with fastai v2
I have recently used
SubwordTokenizer which according to the source code actually uses
SentencePieceTokenizer, however, in fastai v2 (which I stronly recommend).
See this post: Tokenizer with pretrained vocab in fastai
When you create your dataloaders (i.e. databunch in fastai 1 terminology) with
SubwordTokenizer, fastai trains a tokenizer on your corpus and saves it under
'tmp/spm.model' for later use.
Ah I see. Yes I was going off of an nlp notebook jeremy wrote last year so I guess it was using the older version of fastai.
if I was training a language model from scratch do I need to train the tokenizer separately? or can I just include tok=sentencepiecetokenizer() as an argument in my Datablock?
I don’t think you need to explicitely train the tokenizer as I did in the post I shared (there I just wanted to demonstrate how to load a previously trained tokenizer). Setting up the dataloaders with
tok=SubwordTokenizer() should create and save the tokenizer automatically.
If I want to substantiate the dls object again without it retraining the subword tokenizer again, bow would I do that? Would it just be tok=‘tmp/some.model’