How to add tfms to an existing Pipeline (working on Mixed Tabular+Text)

So …

I’m working on updating my article/code on building mixed tabular+text datasets to v.2 and I’m trying to add the text tfms to the existing Pipeline built by the Tabular class.

So far I have this:

cat_names = ['business_id', 'user_id', 'business_stars', 'business_postal_code', 'business_state']
cont_names = ['useful', 'user_average_stars', 'user_review_count', 'business_review_count']
text_names = ['text', 'business_name']

dep_var = 'stars'

lm_vocab = pickle.load(open("vocab.pkl", "rb"))
text_tfms = [Tokenizer.from_df(text_cols=text_names), Numericalize(vocab=lm_vocab)]
tab_procs = [FillMissing, Categorify, Normalize]

@delegates(Tabular)
class MixedTabularPandas(TabularPandas):
    def __init__(self, df, text_names=None, text_tfms=None, vocab=None, **kwargs):
        super().__init__(df, **kwargs)
        self.text_names, self.text_tfms, self.vocab = L(text_names), text_names, vocab
        
        self.procs += L(self.text_tfms)
        
    @property
    def all_col_names (self): 
        return self.cat_names + self.cont_names + self.text_names + self.y_names

But this line self.procs += L(self.text_tfms) doesn’t seem to accomplish that (I tried to use pipeline.add() and that didn’t work either).

Any ideas?

A Tabular object is designed to deal with tabular data only. For mixed tabular + text, I think you will need to use the item transform that are defined at the end of the tabular.core notebook (we have not kept that part to date for now).

Are you talking about the section labeled Not being used now - for multi-modal?

Yup, I am.

What’s interesting if I do this it kinda works …

lm_vocab = pickle.load(open("vocab.pkl", "rb"))
text_tfms = [Tokenizer.from_df(text_cols=text_names), Numericalize(vocab=lm_vocab)]
tab_procs = [FillMissing, Categorify, Normalize]

procs = tab_procs + text_tfms

mtp = MixedTabularPandas(joined_df, text_names, text_tfms, lm_vocab, procs=procs,
                         cat_names=cat_names, cont_names=cont_names, 
                         y_names=dep_var, block_y=CategoryBlock,
                         splits=RandomSplitter()(range_of(joined_df)))

I can see the tokenized text when I do mtp.show(max_n=2) which looks like this for a given row in the DataFrame:

[xxbos, xxfld, 1, i, liked, my, xxmaj, lamb, burger.my, dad, ordwred, fish, and, chips, and, they, were, very, ordinary, ., xxmaj, nothing, special.i, guess, i, was, expecting, outstanding, ,, when, it, comes, to, xxmaj, gordon, xxmaj, ramsey, xxfld, 2, xxmaj, gordon, xxmaj, ramsay, xxmaj, burger]

What I still can’t figure out is how to ensure my Numericalize transform runs against the text field in my dataframe. Based on the above, it looks like the tokenization transform runs fine … but it doesn’t look like Numericalize is doing anything (else I’d expect to see a list of vocab indices).

Any ideas how I can fix this?

I’ll take a look at your code. I feel like I have an approach that is close to working but still trying to come up to speed with all the fastai2 bits after watching the walk-thrus.

-wg

1 Like

Like I said, you can’t used tabular objects for this. They create the batches directly from the dataframe so your transforms are never called (only their setup method is called, which is why you see the tokenization happening).

I got things to work for just the tabular bits with minimal modifications to your code …

class TensorTabular(Tuple):
    
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]

    def display(self, ctxs): 
        display_df(pd.DataFrame(ctxs))

        
class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): 
        return self if ctx is None else ctx.append(self)

    
class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc

    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names, self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())
    
    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        
        return TabularLine(pd.Series({ c: v for v,c in zip(to.items[0]+to.items[1], 
                                                          self.proc.cat_names+self.proc.cont_names) }))

    
class ReadTabTarget(ItemTransform):
    def __init__(self, proc): 
        self.proc = proc
        
    def encodes(self, row): 
        return tensor(row[self.proc.y_names].astype(np.int64))
    
    def decodes(self, o): 
        return Category(self.proc.classes[self.proc.y_names][o])

and then …

ds = DataSource(tp.items, tfms=[[ReadTabLine(tp)], ReadTabTarget(tp)])
dbunch = ds.databunch(bs=4)
b = dbunch.one_batch()

print(b[0], b[1])

returns

((tensor([[   0,    0,    4,    0,   12,    1,    1,    1,    1],
          [2279, 6938,    8,   96,    7,    1,    1,    1,    1],
          [1723, 6945,    6,  234,    2,    1,    1,    1,    1],
          [2829, 4598,    6,  182,   11,    1,    1,    1,    1]]),
  tensor([[-0.3584,  1.5739, -0.3446, -0.4667],
          [ 0.9329, -0.0846, -0.0343, -0.0608],
          [-0.3584, -0.1719, -0.2837, -0.1837],
          [-0.1001,  0.0775, -0.3359, -0.2951]])),
 tensor([[4],
         [4],
         [4],
         [3]]))

What do I need to do in order to add “text” in addition to the tabular bits?

I’m thinking I may have to use the DataBlock api, convert your ReadTabLine and ReadTabTarget to TransformBlocks, and the have get_items return 4 things: cats, conts, text, and dep_var.

Does that sound right or am I still missing something?

Thanks much!

To add text, just add a block of type transforms with the same as in the imdb sample (probably a attrgetter(col_name), Tokenizer.from_df(...), Numericalize()). You will have several inputs, but that’s okay, as long as you specify the value of n_inp properly.

Ok the below is working minus the pad_input transform.

How does one get the transforms to apply to just one of the inputs?

ds = DataSource(tp.items, 
                tfms=[[ReadTabLine(tp)], 
                      [attrgetter('text'), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)],
                      ReadTabTarget(tp)])

dbunch = ds.databunch(bs=4, before_batch=[pad_input])

throws exception: AttributeError: 'TensorTabular' object has no attribute 'new_zeros'

SOLVED:

ds = DataSource(tp.items, 
                tfms=[[ReadTabLine(tp)], 
                      [attrgetter('text'), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)],
                      ReadTabTarget(tp)])

dbunch = ds.databunch(bs=4, before_batch=[partial(pad_input,pad_fields=1)])
b = dbunch.one_batch()
print(b[0], b[1], b[2])

returns …

((tensor([[4903, 4930,    7, 1179,   10,    1,    1,    1,    1],
          [ 720, 2241,    8, 1371,    1,    1,    1,    1,    1],
          [6305,    0,    7,  368,    7,    1,    1,    1,    1],
          [4784, 4777,    8,  286,    2,    1,    1,    1,    1]]),
  tensor([[-9.4521e-02,  2.7762e-01, -1.0790e-01,  3.7826e-01],
          [ 1.4566e+00,  2.2791e-01,  1.7847e+00, -1.0984e-03],
          [-9.4521e-02, -9.5198e-02,  6.1761e-01,  4.5793e-01],
          [ 1.6400e-01, -1.9841e+00, -3.0863e-01, -4.2851e-01]])),
 tensor([[ 2,  4, 25,  ...,  1,  1,  1],
         [ 2,  4, 25,  ...,  1,  1,  1],
         [ 2,  4, 25,  ..., 96,  8,  0],
         [ 2,  4, 25,  ...,  1,  1,  1]]),
 tensor([[3],
         [3],
         [4],
         [4]]))

The datasource .show() and the databunch .show_batch() don’t work yet (working on that now) … but the above looks good for modeling.

1 Like

And almost there …

class TensorTabular(Tuple):
    
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]

    def show(self, ctxs): 
        pdb.set_trace()
        display_df(pd.DataFrame(ctxs))

        
class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): 
#         return self if ctx is None else ctx.append(self)
        return display_df(pd.DataFrame.from_records([s.to_dict() for s in [self]]))

    
class ReadTabLine(ItemTransform):
    def __init__(self, tab_obj): self.tab_obj = tab_obj

    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.tab_obj.cat_names, self.tab_obj.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())
    
    def decodes(self, o):
        cats_d = { col_name: v.item() for v, col_name in zip(o[0], self.tab_obj.cat_names) }
        conts_d = { col_name: v.item() for v, col_name in zip(o[1], self.tab_obj.cont_names) }
        return TabularLine(pd.Series({**cats_d, **conts_d}))

    
class ReadTabTarget(ItemTransform):
    def __init__(self, tab_obj): 
        self.tab_obj = tab_obj
        
    def encodes(self, row): 
        return tensor(row[self.tab_obj.y_names].astype(np.int64))
    
    def decodes(self, o): 
        return self.tab_obj.procs[2].decodes(o)

ds = DataSource(tp.items, 
                tfms=[[ReadTabLine(tp)], 
                      [attrgetter('text'), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)],
                      ReadTabTarget(tp)])

dbunch = ds.databunch(bs=4, before_batch=[partial(pad_input,pad_fields=1)])
dbunch.show_batch(b)

and it looks like this currently:

BUT I’d rather it look like what I was able to produce in v.1 (see below) … any ideas on how/if I can do the same with v.2?

1 Like

It’s weird that it looks like this as text should be inserted in the same dataframe (not necessarily what you want but still…)
To customize a show_batch behavior, use the type-dispatch system to write a new version of show_batch with x and y to the type of your tensors of inputs/targets (see vision.data or text.data for examples of this).

Is there a way to overwrite/clean the @typedispatch dictionary???

I tried adding this to my notebook just to see if I could get the @typedispatch mechanism working:

@typedispatch
def show_batch(x: TensorText, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
    pdb.set_trace()
    if ctxs is None: ctxs = get_empty_df(min(len(samples), max_n))
    samples = L((s[0].truncate(trunc_at),*s[1:]) for s in samples)
    ctxs = show_batch[object](x, y, samples, max_n=max_n, ctxs=ctxs, **kwargs)
    display_df(pd.DataFrame(ctxs))
    return ctxs

… but this code doesn’t get used (it uses the show_batch version in the framework).

Any ideas on how to fix (or maybe I’m just missing something with the typedispatch mechanism)?

x in this case is a tuple[TabularTensor,TensorText] … but I can’t seem to formulate the type hint so that show_batch works. I tried this as well:

@typedispatch
def show_batch(x:Tuple[TensorTabular, TensorText], y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):

… which returns the following error:
TypeError: 'type' object is not subscriptable

Sadly (and insanely!) the Python type system doesn’t support run-time analysis of generic types. So you’d either need to have some combination of types for x and y that matches, or create a subtype of Tuple for your x.

You can manually look up a type using []. Here’s an example:

Makes me love C# even more :slight_smile:

So I tried this just to see what is getting passed in as the x and y

@typedispatch
def show_batch(x, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
    ...

This is what I get:

(Pdb) type(x)
<class 'tuple'>
(Pdb) type(y)
<class 'torch.Tensor'>
(Pdb) type(x[0])
<class '__main__.TensorTabular'>
(Pdb) type(x[1])
<class 'fastai2.text.data.TensorText'>

I tried to type my x as x:Tuple[TensorTabular, TensorText] which returns an error of TypeError: 'type' object is not subscriptable.

I think this might have something to do with how Tuple is defined in fastcore … which overrides, I think, Tuple as defined in the typing library.


from typing import Tuple as Tuple2
mytype = Tuple2[TensorTabular, TensorText]

@typedispatch
def show_batch(x:mytype, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
    pdb.set_trace()

returns this error: Subscripted generics cannot be used with class and instance checks

Right - that’s why I’m saying you can’t do that.

You’ll need to create your own subclass of Tuple.

Ah … I understand a bit more now.

Is there anyway I can take the below and essentially tell Datasets that "Hey, for show purposes, treat the first two things as my x represented by the custom Tuple"?

dsrc = Datasets(tp.items, 
                tfms=[[ReadTabLine(tp)],
                      [attrgetter('text'), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)],
                      [attrgetter('stars'), Categorize()]
                     ])

Or perhaps, is there a way I can just define my own show_batch, show, show_results methods for a specific Datasets and/or Dataloaders instance? One that wouldn’t attempt to use the type system to figure out which showX() method to use?

Hoping there is some simple way to accomplish this so as to take advantage of the DataBlock API. I like how the code above looks and how it works … seems like everything works nicely except for the various showX() methods.

Thoughts?

Actually, what I think should happen (or at least be an option), is for the show methods to call each of the items.

In this case, it would call the show_batch(x:TensorTabular...) method followed by the show_batch(x:TensorText...) method.

I kind of figured this out at least for show_batch (see below):

splits = RandomSplitter()(tp.items)

dsrc = Datasets(tp.items, 
                splits=splits,
                tfms=[[ReadTabLine(tp)],
                      [attrgetter('text'), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)],
                      [attrgetter('stars'), Categorize()]
                     ])

@delegates(TfmdDL)
class MixedTfmdDL(TfmdDL):
    def __init__(self, dataset, **kwargs):
        super().__init__(dataset, **kwargs)
        
    def show_batch(self, b=None, max_n=9, ctxs=None, show=True, **kwargs):
        
        if b is None: b = self.one_batch()
        if not show: return self._pre_show_batch(b, max_n=max_n)
        
        x, y, samples = self._pre_show_batch(b, max_n=max_n)

        for i in range(self.n_inp):
            show_batch(x[i], y, 
                       samples=L([samples.itemgot(i), samples.itemgot(2)]).zip(), 
                       ctxs=ctxs, max_n=max_n, **kwargs)

@typedispatch
def show_batch(x: TensorTabular, y, samples, ctxs=None, max_n=10, **kwargs):
    df = pd.DataFrame(samples.itemgot(0))[:max_n]
    df['stars'] = samples.itemgot(1)
    display_df(pd.DataFrame(df))
    
    return ctxs

dbunch = dsrc.dataloaders(bs=4, before_batch=[partial(pad_input,pad_fields=1)], dl_type=MixedTfmdDL)
dbunch.show_batch()

… displays:

There is some hacky and hard-coded assumptions in here, but it works. The tabular show bits aren’t exactly what you get if you use the tabular stuff on it’s own (show_batch of a Tabular object shows the decoded values)

Is there a better/more eloquent way of doing this?

Open to improvements, which I’m sure there are many.

In fact, I want to do it via data blocks. Something like: FloatBlock, StringBlock, DateBlock, Time Block, IntBlock, ect and predefined block groups like TabularBlocks, ClassificationBlocks and RegresionBlocks and time SeriesBlocks. But, currently I got stacked on windows … May be if I can do it I’ll share in the future…