Hi, I’m trying to train a text classifier based on a ULMFit model with latest fastai2. But if I use a SubwordTokenizer, I’m unable to export it. For the language model itself it is fine, but the classifier needs to be deployed elsewhere so I need to export it. But it doesn’t work because the SubwordTokenizer is a SwigPyObject, like this:
File “/app/text_ml/model_implementations/fastai_ulmfit/fastai.py”, line 287, in train_classifier
self._classifier.export(PosixPath(f’models/{self.name}/classifier.pkl’).absolute())
File “/usr/local/lib/python3.8/site-packages/fastai/learner.py”, line 375, in export
torch.save(self, self.path/fname, pickle_module=pickle_module, pickle_protocol=pickle_protocol)
File “/usr/local/lib/python3.8/site-packages/torch/serialization.py”, line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File “/usr/local/lib/python3.8/site-packages/torch/serialization.py”, line 484, in _save
pickler.dump(obj)
TypeError: cannot pickle ‘SwigPyObject’ object
My definitions are the following:
tok = SubwordTokenizer(lang=self._language, sp_model=self._language_model.paths["tuned_model_path"]/'spm.model')
dblocks = DataBlock(
blocks=(TextBlock.from_df('text', tok=tok, vocab=self._language_model.get_vocabulary()), CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('cat'),
splitter=ColSplitter("is_validation")
)
dls = dblocks.dataloaders(self.df, bs=batch_size, num_workers=2)
early_stopping_cb = partial(EarlyStoppingCallback, monitor='valid_loss', min_delta=0.01, patience=2)
classifier = text_classifier_learner(
dls,
AWD_LSTM,
drop_mult=0.5,
metrics=[accuracy, error_rate, Recall(average='macro')],
cbs=[CSVLogger, early_stopping_cb()]
)
If I remove the tok=tok part from the TextBlock definition, the export works, but obviously does not use the tokenization.
I’ve found that I need to remove the extra metrics and callbacks to be able to export, like this:
classifier.metrics = []
classifier.remove_cb(CSVLogger)
classifier.export(PosixPath(f'models/{self.name}/classifier.pkl').absolute())
Is there also a good way to also remove the tokenizer so that it doesn’t affect the rest of the model?