Just create an optimizer in the first case (with learn.create_opt()
) to avoid the error, I’ll look at the bug when I have time.
Thanks!
After creating the optimizer, the error message changes for the same message when trying to export using fp16:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-88-c9fac12e1d4f> in <module>
----> 1 learn.export(fname='export_fwd_relatorio.pkl')
/media/hdd3tb/data/fastai2/fastai2/learner.py in export(self, fname)
609 #To avoid the warning that come from PyTorch about model not being checked
610 warnings.simplefilter("ignore")
--> 611 torch.save(self, self.path/fname)
612 self.create_opt()
613 self.opt.load_state_dict(state)
~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
326
327 with _open_file_like(f, 'wb') as opened_file:
--> 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
330
~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
399 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
400 pickler.persistent_id = persistent_id
--> 401 pickler.dump(obj)
402
403 serialized_storage_keys = sorted(serialized_storages.keys())
TypeError: can't pickle SwigPyObject objects
Code:
learn.to_fp32()
learn.create_opt()
learn.export(fname='export_fwd_relatorio.pkl')
Related to SwigPy that may help with debugging:
@fmobrj75 are you using TensorBoard? Or how are you generating your Learner
Hi, @muellerzr. I am not. Just using the basic fastai2 features. I will test some things. Maybe it is the Sentencepiece tokenizer.
I can export a language model learner built with SentencePiece, so the bug does not come from here.
Thanks. I will dig further.
So it looks like mark_fields
and rules
can be passed in now … but it doesn’t like add_bos
or add_eos
. Is there a way to pass these in?
custom_tok_rules = defaults.text_proc_rules + [make_replacements, fix_ampm, fix_sentence_ends, fix_hyphenated_words]
%%time
tfms = [
attrgetter('text'),
Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules,
mark_fields=include_fld_tok, add_bos=include_bos_tok, add_eos=include_eos_tok),
Numericalize()
]
lm_dsets = Datasets(items=df,
tfms=[tfms],
splits=ColSplitter(col='is_valid')(df),
dl_type=LMDataLoader)
len(lm_dsets.train), len(lm_dsets.valid)
add_eos
and add_bos
are arguments in the rule lowercase
, so you should change this one to your need.
So I’m assuming this would be the most straight-foward way of overriding the behavior of lower_case
?
defaults.text_proc_rules[:-1] + [partial(lower_case, add_bos=True, add_eos=True), make_replacements, fix_ampm, fix_sentence_ends, fix_hyphenated_words]
Any updates/thoughts on the below for batch-inference?
Just following up to see if there is a better way to create a single dataloader for batch-inference against a future test dataset.
Maybe its already there and I just don’t know. But the goal here is to construct a dataloader than can then be run through a LM to get document vectors for each example. The above works, but its kinda janky.
Any chance we can have the __init__
for TextBlock
modified to accept an override for numericalization?
From this …
class TextBlock(TransformBlock):
def __init__(self, tok_tfm, vocab=None, is_lm=False, seq_len=72):
return super().__init__(type_tfms=[tok_tfm, Numericalize(vocab)],
dl_type=LMDataLoader if is_lm else SortedDL,
dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})
...
… to something like this …
class TextBlock(TransformBlock):
def __init__(self, tok_tfm, num_tfm=None, vocab=None, is_lm=False, seq_len=72):
if (num_tfm is None): num_tfm = Numericalize(vocab)
return super().__init__(type_tfms=[tok_tfm, num_tfm],
dl_type=LMDataLoader if is_lm else SortedDL,
dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})
This would make the TextBlock
more flexible should one decide to override the default max_vocab
and min_freq
arguments in Numericalize
We can probably add them as kwargs using delegates yes.
For your other example, your learner is exported since you are talking about inference, so learn.dls.test_dl
should work. I’m reluctant to add a DataBlock.test_dl
method as it is a bit dangerous: to build the test dl, you need the state from the training dl (for vocab, classes and such things) and the DataBlock does not know them.
Cool … that was going to be my next suggestion.
I think the use case is different here. Here, we want to get the document vectors produced by a LM for each example in the test set … as such, we can’t use the saved dataloaders
used by the LM since the task there is to predict the next token. What we need is a single dataloader that looks like something built for text classification that can be iterated through. Something like:
inf_blocks = (TextBlock.from_df(corpus_cols, is_lm=False, vocab=lm_vocab, seq_len=bptt))
inf_dblock = DataBlock(blocks=inf_blocks, get_x=ColReader('text'), dl_type=SortedDL, splitter=None)
inf_dl = inf_dblock.dataloader(inf_df) # <-- just for inference; to get doc. vecs produced by LM
Lmk if what I’m saying makes sense … if not, I’ll try to rephrase.
You want something that is very specific and a bit wacky (I hadn’t caught on the change of DataLoader
class), and you managed to to it in five lines of code. I call that pretty good There is no better way to do this for now, will discuss with Jeremy about the
DataBlock.test_dl
method, but it seems a bit dangerous for the reasons I expressed earlier.
Can we do that for the pad_input_chunk
as well? Useful in cases were we need to do the padding at the end (default is in front and can’t be changed in the current implementation).
btw, love this in v.2 for ensuring you get your predictions back in the right order should you by chance be using some kind of sorting in your DataLoader
:
preds = preds[np.argsort(test_dl.get_idxs())]
Love it!
This worked very well for me:
test_ds = Datasets(df_test[df_test['is_test']==True], tfms=[x_tfms])
test_dl = test_ds.dataloaders(bs=32, before_batch=pad, seq_len=72)
preds = learn.get_preds(dl=test_dl.train)
Yah that looks nice (basically what I did in v1).
Yup, works great … beautiful:
tfms = [
attrgetter('text'),
Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok),
Numericalize(vocab=lm_vocab, min_freq=min_freq, max_vocab=max_vocab)
]
test_ds = Datasets(items=inf_df, tfms=[tfms], dl_type=SortedDL)
test_dls = test_ds.dataloaders(bs=bsz, seq_len=bptt, before_batch=partial(pad_input_chunk, pad_first=False))
# use the test_dls.train dataloader for batch inference!
len(inf_df), test_dls.n, len(test_dls.train), len(test_dls.valid)
# (11612, 11612, 90, 0)
For the LMLearner
… is beam search (and for that matter other text generation techniques like top k sampling and top p/nucleus sampling) built in?
Back in v.1 we were able to do something like this:
learn.beam_search('The worse thing about parking is ', n_words=40, beam_sz=200)