Using SentencePiece for NLP preprocessing error

wjs20 · November 5, 2020, 1:20pm

I am trying to use SentencePiece to preprocess some sequence data, but I am getting an error message

Permission denied: ““tmp/spm”.model”: No such file or directory Error #2

Does anyone know why this might be? how to fix it?
code below

!pip install sentencepiece
from fastai import *
from fastai.text import *

url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&limit=500&columns=sequence'
#url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&columns=sequence'

seqs = pd.read_csv(url)

TextList.from_df??
bs=128
data = (TextList.from_df(seqs, processor=[OpenFileProcessor(), SPProcessor()])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm()
        .databunch(bs=bs, num_workers=1))

msivanes · November 5, 2020, 2:40pm

This makes be me believe that sentencepiece is not able to perform subword tokenization. My understanding is SPProcessor need to create a tmp/spm.model and tmp/spm.vocab in the directory where this notebook is located.

Also the code is using fastaiv1. And I haven’t used sentencepiece in V1 version.

Need additional information to diagnose further. Please review this How to debug your code and ask for help with fastai v2

wjs20 · November 5, 2020, 3:12pm

Do you have any fixes?

stefan-ai · November 5, 2020, 6:26pm

I have recently used SubwordTokenizer which according to the source code actually uses SentencePieceTokenizer, however, in fastai v2 (which I stronly recommend).

See this post: Tokenizer with pretrained vocab in fastai

When you create your dataloaders (i.e. databunch in fastai 1 terminology) with SubwordTokenizer, fastai trains a tokenizer on your corpus and saves it under 'tmp/spm.model' for later use.

wjs20 · November 5, 2020, 7:16pm

Ah I see. Yes I was going off of an nlp notebook jeremy wrote last year so I guess it was using the older version of fastai.

if I was training a language model from scratch do I need to train the tokenizer separately? or can I just include tok=sentencepiecetokenizer() as an argument in my Datablock?

stefan-ai · November 5, 2020, 8:08pm

I don’t think you need to explicitely train the tokenizer as I did in the post I shared (there I just wanted to demonstrate how to load a previously trained tokenizer). Setting up the dataloaders with tok=SubwordTokenizer() should create and save the tokenizer automatically.

wjs20 · November 6, 2020, 3:28pm

If I want to substantiate the dls object again without it retraining the subword tokenizer again, bow would I do that? Would it just be tok=‘tmp/some.model’

Thanks!