Fastai v2 text

Just create an optimizer in the first case (with learn.create_opt()) to avoid the error, I’ll look at the bug when I have time.

1 Like

Thanks!

After creating the optimizer, the error message changes for the same message when trying to export using fp16:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-88-c9fac12e1d4f> in <module>
----> 1 learn.export(fname='export_fwd_relatorio.pkl')

/media/hdd3tb/data/fastai2/fastai2/learner.py in export(self, fname)
    609         #To avoid the warning that come from PyTorch about model not being checked
    610         warnings.simplefilter("ignore")
--> 611         torch.save(self, self.path/fname)
    612     self.create_opt()
    613     self.opt.load_state_dict(state)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    326 
    327     with _open_file_like(f, 'wb') as opened_file:
--> 328         _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
    329 
    330 

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
    399     pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
    400     pickler.persistent_id = persistent_id
--> 401     pickler.dump(obj)
    402 
    403     serialized_storage_keys = sorted(serialized_storages.keys())

TypeError: can't pickle SwigPyObject objects 

Code:

learn.to_fp32()
learn.create_opt()
learn.export(fname='export_fwd_relatorio.pkl')

Related to SwigPy that may help with debugging:

@fmobrj75 are you using TensorBoard? Or how are you generating your Learner

https://github.com/pytorch/pytorch/issues/32046

Hi, @muellerzr. I am not. Just using the basic fastai2 features. I will test some things. Maybe it is the Sentencepiece tokenizer.

1 Like

I can export a language model learner built with SentencePiece, so the bug does not come from here.

Thanks. I will dig further.

So it looks like mark_fields and rules can be passed in now … but it doesn’t like add_bos or add_eos. Is there a way to pass these in?

custom_tok_rules = defaults.text_proc_rules + [make_replacements, fix_ampm, fix_sentence_ends, fix_hyphenated_words]

%%time

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, 
                      mark_fields=include_fld_tok, add_bos=include_bos_tok, add_eos=include_eos_tok), 
    Numericalize()
]

lm_dsets = Datasets(items=df,
                    tfms=[tfms], 
                    splits=ColSplitter(col='is_valid')(df), 
                    dl_type=LMDataLoader)

len(lm_dsets.train), len(lm_dsets.valid)

add_eos and add_bos are arguments in the rule lowercase, so you should change this one to your need.

2 Likes

So I’m assuming this would be the most straight-foward way of overriding the behavior of lower_case?

defaults.text_proc_rules[:-1] + [partial(lower_case, add_bos=True, add_eos=True), make_replacements, fix_ampm, fix_sentence_ends, fix_hyphenated_words]

Any updates/thoughts on the below for batch-inference?

Just following up to see if there is a better way to create a single dataloader for batch-inference against a future test dataset.

Maybe its already there and I just don’t know. But the goal here is to construct a dataloader than can then be run through a LM to get document vectors for each example. The above works, but its kinda janky.

Any chance we can have the __init__ for TextBlock modified to accept an override for numericalization?

From this …

class TextBlock(TransformBlock):
    def __init__(self, tok_tfm, vocab=None, is_lm=False, seq_len=72):
        return super().__init__(type_tfms=[tok_tfm, Numericalize(vocab)],
                                dl_type=LMDataLoader if is_lm else SortedDL,
                                dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})
...

… to something like this …

class TextBlock(TransformBlock):
    def __init__(self, tok_tfm, num_tfm=None, vocab=None, is_lm=False, seq_len=72):
        if (num_tfm is None): num_tfm = Numericalize(vocab)
        return super().__init__(type_tfms=[tok_tfm, num_tfm],
                                dl_type=LMDataLoader if is_lm else SortedDL,
                                dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})

This would make the TextBlock more flexible should one decide to override the default max_vocab and min_freq arguments in Numericalize

We can probably add them as kwargs using delegates yes.

For your other example, your learner is exported since you are talking about inference, so learn.dls.test_dl should work. I’m reluctant to add a DataBlock.test_dl method as it is a bit dangerous: to build the test dl, you need the state from the training dl (for vocab, classes and such things) and the DataBlock does not know them.

Cool … that was going to be my next suggestion.

I think the use case is different here. Here, we want to get the document vectors produced by a LM for each example in the test set … as such, we can’t use the saved dataloaders used by the LM since the task there is to predict the next token. What we need is a single dataloader that looks like something built for text classification that can be iterated through. Something like:

inf_blocks = (TextBlock.from_df(corpus_cols, is_lm=False, vocab=lm_vocab, seq_len=bptt))
inf_dblock = DataBlock(blocks=inf_blocks, get_x=ColReader('text'), dl_type=SortedDL, splitter=None)

inf_dl = inf_dblock.dataloader(inf_df) # <-- just for inference; to get doc. vecs produced by LM

Lmk if what I’m saying makes sense … if not, I’ll try to rephrase.

You want something that is very specific and a bit wacky (I hadn’t caught on the change of DataLoader class), and you managed to to it in five lines of code. I call that pretty good :wink: There is no better way to do this for now, will discuss with Jeremy about the DataBlock.test_dl method, but it seems a bit dangerous for the reasons I expressed earlier.

2 Likes

Can we do that for the pad_input_chunk as well? Useful in cases were we need to do the padding at the end (default is in front and can’t be changed in the current implementation).

btw, love this in v.2 for ensuring you get your predictions back in the right order should you by chance be using some kind of sorting in your DataLoader:

preds = preds[np.argsort(test_dl.get_idxs())]

Love it!

1 Like

This worked very well for me:

test_ds = Datasets(df_test[df_test['is_test']==True], tfms=[x_tfms])
test_dl = test_ds.dataloaders(bs=32, before_batch=pad, seq_len=72)
preds = learn.get_preds(dl=test_dl.train)
1 Like

Yah that looks nice (basically what I did in v1).

1 Like

Yup, works great … beautiful:

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok), 
    Numericalize(vocab=lm_vocab, min_freq=min_freq, max_vocab=max_vocab)
]

test_ds = Datasets(items=inf_df, tfms=[tfms], dl_type=SortedDL)
test_dls = test_ds.dataloaders(bs=bsz, seq_len=bptt, before_batch=partial(pad_input_chunk, pad_first=False))

# use the test_dls.train dataloader for batch inference!
len(inf_df), test_dls.n, len(test_dls.train), len(test_dls.valid)
# (11612, 11612, 90, 0)
1 Like

For the LMLearner … is beam search (and for that matter other text generation techniques like top k sampling and top p/nucleus sampling) built in?

Back in v.1 we were able to do something like this:

learn.beam_search('The worse thing about parking is ', n_words=40, beam_sz=200)
2 Likes