Fastai v2 text

sgugger · February 20, 2020, 5:25pm

Just create an optimizer in the first case (with learn.create_opt()) to avoid the error, I’ll look at the bug when I have time.

fmobrj75 · February 20, 2020, 5:37pm

Thanks!

After creating the optimizer, the error message changes for the same message when trying to export using fp16:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-88-c9fac12e1d4f> in <module>
----> 1 learn.export(fname='export_fwd_relatorio.pkl')

/media/hdd3tb/data/fastai2/fastai2/learner.py in export(self, fname)
    609         #To avoid the warning that come from PyTorch about model not being checked
    610         warnings.simplefilter("ignore")
--> 611         torch.save(self, self.path/fname)
    612     self.create_opt()
    613     self.opt.load_state_dict(state)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    326 
    327     with _open_file_like(f, 'wb') as opened_file:
--> 328         _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
    329 
    330 

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
    399     pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
    400     pickler.persistent_id = persistent_id
--> 401     pickler.dump(obj)
    402 
    403     serialized_storage_keys = sorted(serialized_storages.keys())

TypeError: can't pickle SwigPyObject objects

Code:

learn.to_fp32()
learn.create_opt()
learn.export(fname='export_fwd_relatorio.pkl')

muellerzr · February 20, 2020, 5:41pm

Related to SwigPy that may help with debugging:

@fmobrj75 are you using TensorBoard? Or how are you generating your Learner

https://github.com/pytorch/pytorch/issues/32046

fmobrj75 · February 20, 2020, 5:45pm

Hi, @muellerzr. I am not. Just using the basic fastai2 features. I will test some things. Maybe it is the Sentencepiece tokenizer.

sgugger · February 20, 2020, 5:50pm

I can export a language model learner built with SentencePiece, so the bug does not come from here.

fmobrj75 · February 20, 2020, 5:55pm

Thanks. I will dig further.

wgpubs · February 20, 2020, 7:36pm

So it looks like mark_fields and rules can be passed in now … but it doesn’t like add_bos or add_eos. Is there a way to pass these in?

custom_tok_rules = defaults.text_proc_rules + [make_replacements, fix_ampm, fix_sentence_ends, fix_hyphenated_words]

%%time

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, 
                      mark_fields=include_fld_tok, add_bos=include_bos_tok, add_eos=include_eos_tok), 
    Numericalize()
]

lm_dsets = Datasets(items=df,
                    tfms=[tfms], 
                    splits=ColSplitter(col='is_valid')(df), 
                    dl_type=LMDataLoader)

len(lm_dsets.train), len(lm_dsets.valid)

sgugger · February 20, 2020, 7:44pm

add_eos and add_bos are arguments in the rule lowercase, so you should change this one to your need.

wgpubs · February 20, 2020, 9:01pm

So I’m assuming this would be the most straight-foward way of overriding the behavior of lower_case?

defaults.text_proc_rules[:-1] + [partial(lower_case, add_bos=True, add_eos=True), make_replacements, fix_ampm, fix_sentence_ends, fix_hyphenated_words]

wgpubs · February 20, 2020, 9:08pm

Any updates/thoughts on the below for batch-inference?

wgpubs:

Is there a way to use the DataBlock API to create a single dl? In this case one for a test set?

UPDATE : This works but seems clunky … is there a better way?

lm_vocab = pickle.load(open(LM_PATH/f'vocab{data_suf}.pkl','rb'))
inf_blocks = (TextBlock.from_df(corpus_cols, is_lm=False, vocab=lm_vocab), noop)

inf_dblock = DataBlock(blocks=inf_blocks, get_x=ColReader('text'), get_y=lambda x: 0, splitter=None)
inf_dls = inf_dblock.dataloaders(inf_df.head(5))
test_dl = inf_dls.test_dl(tokenize_df(inf_df, corpus_cols)[0])

test_dl.n, len(test_dl.items), test_dl.bs, len(inf_df)
# returns ... (11612, 11612, 64, 11612)

It would be nice to just have to do something like:

test_dl = inf_dblock.test_dl(tokenize_df(inf_df, corpus_cols)[0])

Just following up to see if there is a better way to create a single dataloader for batch-inference against a future test dataset.

Maybe its already there and I just don’t know. But the goal here is to construct a dataloader than can then be run through a LM to get document vectors for each example. The above works, but its kinda janky.

wgpubs · February 20, 2020, 9:26pm

Any chance we can have the __init__ for TextBlock modified to accept an override for numericalization?

From this …

class TextBlock(TransformBlock):
    def __init__(self, tok_tfm, vocab=None, is_lm=False, seq_len=72):
        return super().__init__(type_tfms=[tok_tfm, Numericalize(vocab)],
                                dl_type=LMDataLoader if is_lm else SortedDL,
                                dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})
...

… to something like this …

class TextBlock(TransformBlock):
    def __init__(self, tok_tfm, num_tfm=None, vocab=None, is_lm=False, seq_len=72):
        if (num_tfm is None): num_tfm = Numericalize(vocab)
        return super().__init__(type_tfms=[tok_tfm, num_tfm],
                                dl_type=LMDataLoader if is_lm else SortedDL,
                                dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})

This would make the TextBlock more flexible should one decide to override the default max_vocab and min_freq arguments in Numericalize

sgugger · February 20, 2020, 9:52pm

We can probably add them as kwargs using delegates yes.

For your other example, your learner is exported since you are talking about inference, so learn.dls.test_dl should work. I’m reluctant to add a DataBlock.test_dl method as it is a bit dangerous: to build the test dl, you need the state from the training dl (for vocab, classes and such things) and the DataBlock does not know them.

wgpubs · February 20, 2020, 10:16pm

Cool … that was going to be my next suggestion.

I think the use case is different here. Here, we want to get the document vectors produced by a LM for each example in the test set … as such, we can’t use the saved dataloaders used by the LM since the task there is to predict the next token. What we need is a single dataloader that looks like something built for text classification that can be iterated through. Something like:

inf_blocks = (TextBlock.from_df(corpus_cols, is_lm=False, vocab=lm_vocab, seq_len=bptt))
inf_dblock = DataBlock(blocks=inf_blocks, get_x=ColReader('text'), dl_type=SortedDL, splitter=None)

inf_dl = inf_dblock.dataloader(inf_df) # <-- just for inference; to get doc. vecs produced by LM

Lmk if what I’m saying makes sense … if not, I’ll try to rephrase.

sgugger · February 20, 2020, 10:24pm

You want something that is very specific and a bit wacky (I hadn’t caught on the change of DataLoader class), and you managed to to it in five lines of code. I call that pretty good There is no better way to do this for now, will discuss with Jeremy about the DataBlock.test_dl method, but it seems a bit dangerous for the reasons I expressed earlier.

wgpubs · February 20, 2020, 11:00pm

Can we do that for the pad_input_chunk as well? Useful in cases were we need to do the padding at the end (default is in front and can’t be changed in the current implementation).

wgpubs · February 20, 2020, 11:04pm

btw, love this in v.2 for ensuring you get your predictions back in the right order should you by chance be using some kind of sorting in your DataLoader:

preds = preds[np.argsort(test_dl.get_idxs())]

Love it!

fmobrj75 · February 20, 2020, 11:07pm

This worked very well for me:

test_ds = Datasets(df_test[df_test['is_test']==True], tfms=[x_tfms])
test_dl = test_ds.dataloaders(bs=32, before_batch=pad, seq_len=72)
preds = learn.get_preds(dl=test_dl.train)

wgpubs · February 20, 2020, 11:09pm

Yah that looks nice (basically what I did in v1).

wgpubs · February 20, 2020, 11:19pm

Yup, works great … beautiful:

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok), 
    Numericalize(vocab=lm_vocab, min_freq=min_freq, max_vocab=max_vocab)
]

test_ds = Datasets(items=inf_df, tfms=[tfms], dl_type=SortedDL)
test_dls = test_ds.dataloaders(bs=bsz, seq_len=bptt, before_batch=partial(pad_input_chunk, pad_first=False))

# use the test_dls.train dataloader for batch inference!
len(inf_df), test_dls.n, len(test_dls.train), len(test_dls.valid)
# (11612, 11612, 90, 0)

wgpubs · February 20, 2020, 11:45pm

For the LMLearner … is beam search (and for that matter other text generation techniques like top k sampling and top p/nucleus sampling) built in?

Back in v.1 we were able to do something like this:

learn.beam_search('The worse thing about parking is ', n_words=40, beam_sz=200)