How to export a classifier with SubwordTokenizer?

Hi, I’m trying to train a text classifier based on a ULMFit model with latest fastai2. But if I use a SubwordTokenizer, I’m unable to export it. For the language model itself it is fine, but the classifier needs to be deployed elsewhere so I need to export it. But it doesn’t work because the SubwordTokenizer is a SwigPyObject, like this:

File “/app/text_ml/model_implementations/fastai_ulmfit/fastai.py”, line 287, in train_classifier
self._classifier.export(PosixPath(f’models/{self.name}/classifier.pkl’).absolute())
File “/usr/local/lib/python3.8/site-packages/fastai/learner.py”, line 375, in export
torch.save(self, self.path/fname, pickle_module=pickle_module, pickle_protocol=pickle_protocol)
File “/usr/local/lib/python3.8/site-packages/torch/serialization.py”, line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File “/usr/local/lib/python3.8/site-packages/torch/serialization.py”, line 484, in _save
pickler.dump(obj)
TypeError: cannot pickle ‘SwigPyObject’ object

My definitions are the following:


tok = SubwordTokenizer(lang=self._language, sp_model=self._language_model.paths["tuned_model_path"]/'spm.model')
 		dblocks = DataBlock(
 			blocks=(TextBlock.from_df('text', tok=tok, vocab=self._language_model.get_vocabulary()), CategoryBlock),
 			get_x=ColReader('text'),
 			get_y=ColReader('cat'), 
 			splitter=ColSplitter("is_validation")
 		)
 		dls = dblocks.dataloaders(self.df, bs=batch_size, num_workers=2)
early_stopping_cb = partial(EarlyStoppingCallback, monitor='valid_loss', min_delta=0.01, patience=2)

classifier = text_classifier_learner(
	dls,
	AWD_LSTM,
	drop_mult=0.5,
	metrics=[accuracy, error_rate, Recall(average='macro')],
	cbs=[CSVLogger, early_stopping_cb()]
)

If I remove the tok=tok part from the TextBlock definition, the export works, but obviously does not use the tokenization.
I’ve found that I need to remove the extra metrics and callbacks to be able to export, like this:

classifier.metrics = []
classifier.remove_cb(CSVLogger)
classifier.export(PosixPath(f'models/{self.name}/classifier.pkl').absolute())

Is there also a good way to also remove the tokenizer so that it doesn’t affect the rest of the model?

Never mind, actually the issue was an old version of SentencePiece (which I had downgraded to debug another issue), upgrading back to 0.1.96 solved the problem.