Fastai v2 text

chess · April 2, 2020, 6:32pm

When doing a categorical prediction:

imdb_clas = DataBlock(blocks=(TextBlock.from_df(['days','comments'], vocab=dbunch.vocab), CategoryBlock),
                      get_x=attrgetter('text'),
                      get_y=attrgetter('liked'),
                      splitter=TrainTestSplitter(test_size = 0.2, stratify=df_scores['liked'], random_state = 12))

dbunch_class = imdb_clas.dataloaders(df, bs=32, seq_len=80)

learn = text_classifier_learner(dbunch_class, AWD_LSTM, drop_mult=0.5, metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()
learn = learn.load_encoder('finetuned6_208.pkl')

I’m able to iterate through, and get a mapping of all the categories, along with its index. For example, with index 0:

learn.dls.categorize.decode(0)
'happy'

When I change this to a multicategory datablock, and target a different column which has comma-separated values:

imdb_clas = DataBlock(blocks=(TextBlock.from_df(['days','comments'], vocab=dbunch.vocab), MultiCategoryBlock),
                      get_x=attrgetter('text'),
                      splitter=RandomSplitter(seed = 42),
                      get_y=ColReader(3, label_delim=','))

dbunch_class = imdb_clas.dataloaders(df, bs=32, seq_len=80)


learn = text_classifier_learner(dbunch_class, AWD_LSTM, drop_mult=0.5, metrics=[accuracy_multi, Perplexity()], wd = 0.1).to_fp16()
learn = learn.load_encoder('finetuned6_208.pkl')

The same command gives me an error:

learn.dls.categorize.decode(0)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-213-cc679bf817bd> in <module>
----> 1 learn.dls.categorize.decode(0)

~/git_packages/fastcore/fastcore/foundation.py in __getattr__(self, k)
    228         if self._component_attr_filter(k):
    229             attr = getattr(self,self._default,None)
--> 230             if attr is not None: return getattr(attr,k)
    231         raise AttributeError(k)
    232     def __dir__(self): return custom_dir(self,self._dir())

~/git_packages/fastcore/fastcore/foundation.py in __getattr__(self, k)
    228         if self._component_attr_filter(k):
    229             attr = getattr(self,self._default,None)
--> 230             if attr is not None: return getattr(attr,k)
    231         raise AttributeError(k)
    232     def __dir__(self): return custom_dir(self,self._dir())

~/git_packages/fastai2/fastai2/data/core.py in __getattr__(self, k)
    285         return res if is_indexer(it) else list(zip(*res))
    286 
--> 287     def __getattr__(self,k): return gather_attrs(self, k, 'tls')
    288     def __dir__(self): return super().__dir__() + gather_attr_names(self, 'tls')
    289     def __len__(self): return len(self.tls[0])

~/git_packages/fastcore/fastcore/transform.py in gather_attrs(o, k, nm)
    151     att = getattr(o,nm)
    152     res = [t for t in att.attrgot(k) if t is not None]
--> 153     if not res: raise AttributeError(k)
    154     return res[0] if len(res)==1 else L(res)
    155 

AttributeError: categorize

So when I do a get_preds(), I can’t tell which predictions are for which categories:

learn.get_preds()
a[0]
tensor([0.8300, 0.1622, 0.4473, 0.2733, 0.0688, 0.0237, 0.0163, 0.0094, 0.0798,
        0.4658, 0.0957, 0.0152, 0.0296, 0.1708])

In fastai version 1, I can do this with c2i:

learn.data.c2i
{'team1': 0, 'team2': 1, 'team3': 2}

Is there an equivalent for the MultiCategoryBlock in fastai version 2?

sgugger · April 2, 2020, 7:08pm

There is no categorize attribute because you don’t have a Categorize transform. In this case, it’s multi_categorize you are looking at, or the encoded version.

chess · April 2, 2020, 7:46pm

Perfect! I checked the code and tried multicategorize, but not multi_categorize!

I’m not sure how I could’ve found this on my own. I checked the fastai2, and then the entire fastai org’s github repo and found no hits:

In any case, I got it to work, but the results aren’t what I expected:

learn.dls.multi_categorize.decode("0")
(#1) [(#1) ['team1']]

Works as expected.

learn.dls.multi_categorize.decode("10")
(#2) [(#1) ['team2'],(#1) ['team1']]

Instead of showing team11 as expected, it shows a combination of the two teams. It also goes up much higher than my 14 categories.

learn.dls.multi_categorize.decode("5000")
(#4) [(#1) ['team6'],(#1) ['team1'],(#1) ['team1'],(#1) ['team1']]

I’ve confirmed that team11 does indeed show up in the dataset. Am I using it wrong?

sgugger · April 2, 2020, 7:48pm

I’m not sure how you found categorize on your own either then. All transforms are accessible as attributes with camel2snake names. Since you have a MultiCategorize, it becomes a multi_categorize attribute.

Otherwise, that transform takes lists of labels, so you “10” is interpreted as “1” and “0”. You should pass [“10”].

chess · April 2, 2020, 8:19pm

Thanks again!

I found the code:

https://github.com/fastai/fastai2/blob/d0ad07bf77e43126df1f19a01946e8a100955f46/fastai2/data/transforms.py

And was able to get it working with:

learn.dls.multi_categorize.decodes([["10"]])
(#1) [(#1) ['team11']]

Thanks!!

zerotosingularity · April 2, 2020, 8:42pm

I’m still facing this problem:

PyTorch: 1.4.0
Fastai2: 0.0.16
Cuda: 10.2

I’m wondering whether this might be caused by my data, but from what I can tell, the inputs are the same as the IBMD notebook.

Any ideas for debugging would be welcome…

waydegg · April 3, 2020, 12:33am

Cuda 10.2. Using latest torch/fastai2/fastcore

hello34 · April 5, 2020, 1:47pm

Can we do entity level sentiment analysis with fastaiv2 text library?

waydegg · April 6, 2020, 12:50am

I’ve tried using both the Datasets and Datablock API to create a Dataloader, and both times I try to create a Dataloader I get RuntimeError: Could not infer type of method.

Here’s the code I’m trying to run (using the Datasets api):

x_tfms = [
    attrgetter('text'),
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok),
    Numericalize(vocab=vocab)
]

y_tfms = [
    ColReader(LABELS),
    EncodedMultiCategorize(vocab=LABELS)
]

dsets = Datasets(items=df,
                 tfms=[x_tfms, y_tfms],
                 splits=ColSplitter(col='is_valid')(df),
                 dl_type=SortedDL)

And here’s the stack trace:


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-45-742d546a771e> in <module>
     13                  tfms=[x_tfms, y_tfms],
     14                  splits=ColSplitter(col='is_valid')(df),
---> 15                  dl_type=SortedDL)

~/development/_training/fastai2/fastai2/data/core.py in __init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
    278     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    279         super().__init__(dl_type=dl_type)
--> 280         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    281         self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
    282 

~/development/_training/fastai2/fastai2/data/core.py in <listcomp>(.0)
    278     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    279         super().__init__(dl_type=dl_type)
--> 280         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    281         self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
    282 

~/anaconda3/envs/patternai/lib/python3.7/site-packages/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)

~/development/_training/fastai2/fastai2/data/core.py in __init__(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose)
    216         if do_setup:
    217             pv(f"Setting up {self.tfms}", verbose)
--> 218             self.setup(train_setup=train_setup)
    219 
    220     def _new(self, items, split_idx=None, **kwargs):

~/development/_training/fastai2/fastai2/data/core.py in setup(self, train_setup)
    238             for f in self.tfms.fs:
    239                 self.types.append(getattr(f, 'input_types', type(x)))
--> 240                 x = f(x)
    241             self.types.append(type(x))
    242         types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()

~/anaconda3/envs/patternai/lib/python3.7/site-packages/fastcore/transform.py in __call__(self, x, **kwargs)

~/anaconda3/envs/patternai/lib/python3.7/site-packages/fastcore/transform.py in _call(self, fn, x, split_idx, **kwargs)

~/anaconda3/envs/patternai/lib/python3.7/site-packages/fastcore/transform.py in _do_call(self, f, x, **kwargs)

~/anaconda3/envs/patternai/lib/python3.7/site-packages/fastcore/dispatch.py in __call__(self, *args, **kwargs)

~/development/_training/fastai2/fastai2/data/transforms.py in encodes(self, o)
    264     loss_func,order=BCEWithLogitsLossFlat(),1
    265     def __init__(self, vocab): self.vocab,self.c = vocab,len(vocab)
--> 266     def encodes(self, o): return TensorMultiCategory(tensor(o).float())
    267     def decodes(self, o): return MultiCategory (one_hot_decode(o, self.vocab))
    268 

~/development/_training/fastai2/fastai2/torch_core.py in tensor(x, *rest, **kwargs)
    111            else _array2tensor(x) if isinstance(x, ndarray)
    112            else as_tensor(x.values, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
--> 113            else as_tensor(x, **kwargs) if hasattr(x, '__array__') or is_iter(x)
    114            else _array2tensor(array(x), **kwargs))
    115     if res.dtype is torch.float64: return res.float()

RuntimeError: Could not infer dtype of method

At first I thought it maybe had something to do with a couple of custom methods I have in custom_tok_rules, however they all worked completely fine when I was making another Dataset to train a Language Model earlier.

Anyone have any ideas?

sgugger · April 6, 2020, 1:05pm

You should use the debugger to check what the encodes method receives for o.

lgvaz · April 6, 2020, 9:36pm

I’m starting to get familiar with NLP and I have a question about what is happening inside .predict.

The first thing we do is reset the model:

self.model.reset()

In order to predict n_words we do a for loop and feed forward the model for each prediction:

        for _ in (range(n_words) if no_bar else progress_bar(range(n_words), leave=False)):
            with self.no_bar(): preds,_ = self.get_preds(dl=[(idxs[None],)])

In the first iteration, idxs will be the starting text we provide, let’s say it’s:
The answer to the Universe and

We then predict the next word (idx) (let’s say it was everything) and do:

idxs = torch.cat([idxs, idxs.new([idx])])

We then go and feed idxs to our model again. And this is the confusing part, since we are not resetting our model, this means that the model is effectively seeing the following:

The answer to the Universe and The answer to the Universe and everything

And so on…

Shouldn’t we just feed the new generated word to our model at this point? The hidden state should take care of the previous words.

sgugger · April 6, 2020, 9:48pm

Good question!
You can feed it just the new index (when dealing with LSTMs or stateful model) or you can feed it the things from the beginning again, and surprisingly, the latter gives better results if your prompt is small (it also works with any model, not just LSTMs). Don’t forget that you have the BOS token fed each time so the model knows to reset iself.

So it’s kind of a hack/mess, but I left it that way. You can certainly add the model reset at each loop and tell us what you see (or adapt the function for predicting just based on the new index). I guess we could add an argument use_state to trigger this behavior.

waydegg · April 6, 2020, 9:48pm

So I found a couple things:

1.) My columns had slashes ("/") in them which after removing them made that error stop. (Edit: ‘&’ characters also cause this too)

2.) I was then getting a problem with ColSplitter where my training and validation datasets were the same size (and my validation values were weirdly all the same item just repeated a bunch of times), however I realized that is_valid needed to be all Boolean values not int, which is what I had before.

I’m sure other people may have run into these problems so I hope this helps!

lgvaz · April 6, 2020, 10:06pm

And it gives way better results, try to spot the one that uses only the last word and the one that uses the complete sentence:

this movie , from Steven Spielberg 's dreamworks Pictures , is a great achievement in animated movies . It tells a story of biblical proportions ( literally ! ) about slave Xxunk ’ life as thanks nevermore teleporter and Krause examined by a very small subsequent times locals warned that the Hitchcock heaven mystic . Peter Lesbians , although the commanders and party , fresh air pai . Dillane 's first -

this movie , from Steven Spielberg 's dreamworks Pictures , is a great achievement in animated movies . It tells a story of biblical proportions ( literally ! ) about slave Xxunk ’ life as a mother of children and mussolini in the aftermath of the Project , black capital , oil transportation , church imaging and red Sunlight guitar . The film sequel , Lucius J. Cohen 's Meet

The first one uses only that last word, the second one “sounds better”? lol
The first one is much faster to compute though so I’ll have to go with that for my Electra style experiment

I could make it a PR
use_state might be a confusing name, does it means we only rely on the hidden state or that we feed the sentence over and over again? Maybe we can try something like repeat_sentence, only_last_word?

sgugger · April 6, 2020, 10:24pm

A PR would be welcome yes. only_last_word is a bit of a mouthful but I can’t think of anything better right now. Me saying use_state was to highlight the fact it only works for stateful models but it’s not ideal, I agree.

lgvaz · April 7, 2020, 3:06am

I need to get the numericalized results from .predict without the removal of BOS and PAD tokens.

Is it a good idea to add that to the lib or is it too specific to my use case?

In a vision learner, predict also returns the preds without being decoded, right? So I think this might be analogous to that.

sgugger · April 7, 2020, 11:16am

I would avoid too much complexity in the prediction methods of the LMLearner right now: the source code is easy to grab and modify and we made it really easy for the user to patch their own predict methods. The idea is to have something basic that can tell us how good (or bad) our language model is, but for more advanced stuff (there’s also beam search for instance), I’d recommend the huggingface library.

We might decide to invest more in this in the future but for now, I’d rather keep it simple since we would have no time to properly maintain it

muellerzr · April 7, 2020, 7:24pm

Is there a quick way to get the current rules applied on the Spacy tokenizer? IE in the old version we could do:

tok = data_lm.train_ds.x.processor[0]

Then we could further:

tok.tokenizer

And it would tell us the rules applied like so:


Tokenizer SpacyTokenizer in en with the following rules:
 - fix_html
 - replace_rep
 - replace_wrep
 - spec_add_spaces
 - rm_useless_spaces
 - replace_all_caps
 - deal_caps

I see that we have a data_lm.rules however it’s not very english

[<function fastai2.text.core.fix_html>,
 <function fastai2.text.core.replace_rep>,
 <function fastai2.text.core.replace_wrep>,
 <function fastai2.text.core.spec_add_spaces>,
 <function fastai2.text.core.rm_useless_spaces>,
 <function fastai2.text.core.replace_all_caps>,
 <function fastai2.text.core.replace_maj>,
 <function fastai2.text.core.lowercase>]

(But if that’s intentional then it’s okay!)

sgugger · April 7, 2020, 9:26pm

There is no nice repr anywhere no (though [r.__name__ for r in dls_lm.rules] will give you something more English).

muellerzr · April 8, 2020, 4:06pm

I wrote a quick function to make it readable, unsure how we could fit this into a quick function a bit better but (or if people would prefer it formatted a bit differently):

def print_rules(dls):
  "Prints out current rules of `Tokenizer`"
  print(f"{dls.tokenizer[0].__doc__} with the following rules\n")
  [print(f"{r.__name__, r.__doc__}") for r in dls.rules]

Here’s the output:

Spacy tokenizer for `lang` with the following rules

('fix_html', "Various messy things we've seen in documents")
('replace_rep', 'Replace repetitions at the character level: cccc -- TK_REP 4 c')
('replace_wrep', 'Replace word repetitions: word word word word -- TK_WREP 4 word')
('spec_add_spaces', 'Add spaces around / and #')
('rm_useless_spaces', 'Remove multiple spaces')
('replace_all_caps', 'Replace tokens in ALL CAPS by their lower version and add `TK_UP` before.')
('replace_maj', 'Replace tokens in ALL CAPS by their lower version and add `TK_UP` before.')
('lowercase', 'Converts `t` to lowercase')

Let me know if this is interesting and I’ll test it on SentencePiece (also looks like there may be a few mixups in the documentation for each that needs to get fixed? See all_caps vs remove_maj, and couldn’t find an easy way to get the tokenizer’s language)