How does "attrgetter" work in a pipeline?

wgpubs · February 4, 2020, 7:49pm

Trying to wrap my head around this like of code …

[attrgetter('text'), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)]

Typically, when I read the transforms pipeline, I read it as, “apply this transform, and then this transform on the result of the previous transform, etc…”

But, attrgetter('text') operates on the result of Tokenizer.from_df(txt_cols) … rather than vice-versa. It feels funny/odd to me.

Is there a better way to read the above pipeline?

muellerzr · February 4, 2020, 7:51pm

You should look at the ‘Order’ attribute for each. This dictates when each transform is being applied. Something with an order 1 is done first, 99 is last. This should help some. Tell us what you find out by looking there

wgpubs · February 4, 2020, 8:10pm

Tokenizer order = 0
Numericalize order = 0
Transform(attrgetter('text')).order = 0

muellerzr · February 4, 2020, 8:14pm

Now that plays a particular question, one I almost wanted to ask myself. If we have ‘n’ transforms with the same order, how is their execution dictated? Is it done by the order they are declared as? Or how does the library handle such an instance. (Excellent question @wgpubs )

I know with images, each transform can be conducted if they’re in the same order by TypeDispatch (IE if I had a transform that went Image -> Points -> X and they all had the same order, it would be whichever has the current state before moving onto the next).

muellerzr · February 4, 2020, 8:22pm

Oh! @wgpubs. I was playing with this feature today actually. Do dblock.summary() and provide an input to use. This will show exactly what is being called when! If you need help setting it up tell me, I used it earlier for points

wgpubs · February 4, 2020, 8:38pm

Ah cool … I didn’t know that existed!

I’m not using DataBlocks though … building the Datasets straight-up …

splits = RandomSplitter()(df)
x_tfms = [attrgetter("text"), Tokenizer.from_df(text_cols), Numericalize(vocab=lm_vocab)]
dsrc = Datasets(df, 
    splits=splits, 
    tfms=[x_tfms, [attrgetter("label"), Categorize()]], 
    dl_type=SortedDL)

muellerzr · February 4, 2020, 8:42pm

@wgpubs yes, it’s meant for the DataBlock only. You could most likely utilize parts of it’s code though to help guide you through debugging. Specifically here:

  x = dsets.train[0]
  for f in dls.train.after_item:
    name = f.name
    x = f(x)
    print(x, name)

What this will essentially do is grab an item in your dataset, apply from your dataloaders the transform. This could also easily be refactured to work with one item (and your own dataset)

much_learner · February 4, 2020, 8:53pm

What exactly do I need to pass as a source?

wgpubs · February 4, 2020, 8:54pm

splits = RandomSplitter()(joined_df)
x_tfms = [attrgetter("text"), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)]
dsrc = Datasets(joined_df, splits=splits, tfms=[x_tfms, [attrgetter("stars"), Categorize()]], dl_type=SortedDL)

dbunch = dsrc.dataloaders(bs=8)

x = dsrc.train[i]
for f in dbunch.train.tfms:
    print(f)

returns

Pipeline: (#3) [
Transform: True (object,object) -> attrgetter ,
Tokenizer: True (str,object) -> encodes (Path,object) -> encodes ,
Numericalize: True (object,object) -> encodes (object,object) -> decodes
]

Pipeline: (#2) [
Transform: True (object,object) -> attrgetter ,
Categorize: True (Tabular,object) -> encodes
                 (object,object) -> encodes 
                 (Tabular,object) -> decodes
                 (object,object) -> decodes
]

muellerzr · February 4, 2020, 8:57pm

@wgpubs now apply it to an item in your datasource. IE:

x = dsrc.train[0]
for f in dbunch.train.after_item:
  name = fn.ame
  x = f(x)

(you can do dbunch.train.after_item to see what those actually are)

muellerzr · February 4, 2020, 8:58pm

It would depend on how you’re using it (and what your dataloaders expects to get). Since mine work on image files, I passed in a source folder, similar to how I make my .databunch

wgpubs · February 4, 2020, 9:26pm

x = dsrc.train[0]
for f in dbunch.train.after_item:
  #name = f.name
    x = f(x)
    print(x)

f.name throws an exception and there is only 1 thing in after_item with print(x) returning:

(TensorText([    2,     4,    25,     8,   124,    21,    35,    14,  2518,   611,
           41, 11360,   161,    11,    17,    16,   519,   292,     9,     8,
           17,    16,    37,   421,    81,     0,   161,    75,    92,    18,
           10,   122,   609,    18,   487,   464,   980,    12,    98,    89,
           66,    10,   312,    76,    19,  1055,   217,     9,     8,   122,
          259,    32,   298,    29,    19,    89,    18,    23,  1393,     8,
          595,  3673, 10605,    29,  1168,     9,     8,    10,   474,   210,
           19,   238,    12,    11,   118,    15,   215,  3809,    43,   513,
            9,     8,    74,  6446,   161,    15,  5534,   161,    11,    94,
           74,  9126,   573,    15,  7798,   573,    12,   631,   116,   784,
          116,  2200,    19,    93, 11711,     9,     8,   349,    19,    93,
            0,   928,   382,     9,     8,    45,   254,    19,    47,     9,
            8,   580,   547,    19,   582,     9,     8,    84,   537,    12,
           33,   313,   135,  1611,    21,    10,   611,     9,     4,    23,
            8,  1145,     8,   595,     8,   889]), TensorCategory(2))

muellerzr · February 4, 2020, 9:32pm

Hmmmm. Would it be possible to try doing this in the mid-level API just so we can see what this will give us? (I can do this on IMDB sample later tonight) (I’m figuring this out as you do )

wgpubs · February 4, 2020, 9:34pm

Sounds good. I’ll keep playing myself and report back if I figure out what is going on first.

I think the reason we’re not seeing anything in the for loop is because all the item transforms have ran when creating the Datasets object. Thus …

dsrc.train[0]

returns

(TensorText([    2,     4,    25,     8,   124,    21,    35,    14,  2518,   611,
            41, 11360,   161,    11,    17,    16,   519,   292,     9,     8,
            17,    16,    37,   421,    81,     0,   161,    75,    92,    18,
            10,   122,   609,    18,   487,   464,   980,    12,    98,    89,
            66,    10,   312,    76,    19,  1055,   217,     9,     8,   122,
           259,    32,   298,    29,    19,    89,    18,    23,  1393,     8,
           595,  3673, 10605,    29,  1168,     9,     8,    10,   474,   210,
            19,   238,    12,    11,   118,    15,   215,  3809,    43,   513,
             9,     8,    74,  6446,   161,    15,  5534,   161,    11,    94,
            74,  9126,   573,    15,  7798,   573,    12,   631,   116,   784,
           116,  2200,    19,    93, 11711,     9,     8,   349,    19,    93,
             0,   928,   382,     9,     8,    45,   254,    19,    47,     9,
             8,   580,   547,    19,   582,     9,     8,    84,   537,    12,
            33,   313,   135,  1611,    21,    10,   611,     9,     4,    23,
             8,  1145,     8,   595,     8,   889]), TensorCategory(2))

lgvaz · February 4, 2020, 10:07pm

I think so, unless it’s resolved by another strategy like @muellerzr described

You can see from the definition of pipeline here that it calls L.sorted(key='order'):

self.fs = L(ifnone(funcs,[noop])).map(mk_transform).sorted(key='order')

And here you can see that L just calls the python sorted method:

return self._new(sorted(self.items, key=k, reverse=reverse))

sorted will keep the original order.

wgpubs · February 4, 2020, 10:12pm

I think @lgvaz is right … it is executed in order.

The problem seems to be with attrgetter when multiple attributes are used. Consider this:

my_df = joined_df.head(1)
tfms=[[attrgetter('text'), Tokenizer.from_df(['business_name', 'text']), Numericalize(vocab=lm_vocab)], Categorize()]                          
for idx, t in enumerate(L(tfms[0])):
    print(type(t))
    x = t(my_df)
    print(type(x))
    print('')

# returns
# <class 'operator.attrgetter'>
# <class 'pandas.core.series.Series'>

# <class 'fastai2.text.core.Tokenizer'>
# <class 'pandas.core.frame.DataFrame'>

# <class 'fastai2.text.data.Numericalize'>
# <class 'fastai2.text.data.TensorText'>

and this

my_df = joined_df.head(1)
tfms=[[attrgetter('business_name', 'text'), Tokenizer.from_df(['business_name', 'text']), Numericalize(vocab=lm_vocab)], Categorize()]                          
for idx, t in enumerate(L(tfms[0])):
    print(type(t))
    x = t(my_df)
    print(type(x))
    print('')

# returns
# <class 'operator.attrgetter'>
# <class 'tuple'>

# <class 'fastai2.text.core.Tokenizer'>
# <class 'pandas.core.frame.DataFrame'>

# <class 'fastai2.text.data.Numericalize'>
# <class 'fastai2.text.data.TensorText'>

Thus, you get this error when attempting to use multiple columns for your text …

splits = RandomSplitter()(joined_df)
x_tfms = [attrgetter("business_name","text"), Tokenizer.from_df(["business_name","text"]), Numericalize(vocab=lm_vocab)]
dsrc = Datasets(joined_df, splits=splits, tfms=[x_tfms, [attrgetter("stars"), Categorize()]], dl_type=SortedDL)

Because attrgetter in this case returns a tuple rather than a series that Tokenizer can work with.

wgpubs · February 4, 2020, 11:07pm

SOLVED: (I think)

attrgetter is a callable that appears to act on the results from the following Transform.

Consider this …

sample_item_df = joined_df.head(1)
f = attrgetter('text', 'business_name')
f(sample_item_df)

… returns a tuple of Series objects after acting on sample_item_df

Now consider this …

splits = RandomSplitter()(joined_df)
x_tfms = [attrgetter("text"), Tokenizer.from_df(txt_cols), Numericalize(vocab=lm_vocab)]
dsrc = Datasets(joined_df, splits=splits, tfms=[x_tfms, [attrgetter("stars"), Categorize()]], dl_type=SortedDL)

… here attrgetter acts on the results from the next Transform, which in this case is a Tokenizer transform that takes the txt_cols in the DataFrame and tokenizes them into the attribute text. Thus the pipeline can be read like this:

"Take the ‘text’ attribute created in the process of tokenizing the ‘txt_cols’ columns and numericalize it"

jeremy · February 4, 2020, 11:48pm

Just in case it wasn’t clear - attrgetter is from the python stdlib
https://docs.python.org/2/library/operator.html

wgpubs · February 4, 2020, 11:55pm

So is my explanation of how it works in the tfms pipeline accurate?

If so, then how should I read what happens with our targets here: [attrgetter("stars"), Categorize()]?

jeremy · February 5, 2020, 12:28am

Since I’m on a book deadline, that’s all I’m saying for now