Is it expected for MultiCategorize to encode str/ tuple?

riven314 · May 13, 2021, 10:36am

Not sure if it is a bug so I didnt post it in github but instead post it here. I encountered a situation where MultiCategorize return unexpected outputs when the input is str or tuple.

My labels are string of ids. While feeding in list of these string ids work perfectly fine. When I feed in one single string id, it treats str as iterable and then iterate each character of the string, and then try to encode each character. The same happens when I feed in tuple of a string id. As a result, I receive wrong outputs when I input str/ tuple.

Illustrated my point with the example below:

_vocab=list(map(str, range(24)))
_tfms = MultiCategorize(vocab=_vocab)

# expected behavior
_tfms(['21'])
>> TensorMultiCategory([21])

# unexpected behavior
_tfms('21')
>>TensorMultiCategory([2, 1])

# unexpected behavior
_tfms(('21'))
TensorMultiCategory([2, 1])

When I look through the doc, MultiCategorize typically expect list, or its subclasses as inputs. I think it may create confusion if it silently accepts str/ tuple coz it may return wildly different outputs as shown above.
In this case, Is it better to raise a warning/ error when user feed in tuple/ str, rather than silently passing it through?

marii · May 13, 2021, 12:21pm

github.com

fastai/fastcore/blob/2f9f31feaae7d3c687711f115f1372dd43e05747/fastcore/transform.py#L85


    def __repr__(self): return f'{self.name}:\nencodes: {self.encodes}decodes: {self.decodes}'


    def setup(self, items=None, train_setup=False):
        train_setup = train_setup if self.train_setup is None else self.train_setup
        return self.setups(getattr(items, 'train', items) if train_setup else items)


    def _call(self, fn, x, split_idx=None, **kwargs):
        if split_idx!=self.split_idx and self.split_idx is not None: return x
        return self._do_call(getattr(self, fn), x, **kwargs)


    def _do_call(self, f, x, **kwargs):
        if not _is_tuple(x):
            if f is None: return x
            ret = f.returns(x) if hasattr(f,'returns') else None
            return retain_type(f(x, **kwargs), x, ret)
        res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
        return retain_type(res, x)


add_docs(Transform, decode="Delegate to <code>decodes</code> to undo transform", setup="Delegate to <code>setups</code> to set up transform")


# Cell

The code above is what is determining this behavior. Fastai treats tuples as a “special” class and recursively looks through it to find ‘21’. It then considers ‘21’ a list of ‘2’ and ‘1’.

This is actually undocumented functionality here, but shows up in the core of fastai. It would be good to add the functionality to the docs actually. Below is an example of something like applying multiple categories, such as having a model that outputs more than one prediction.

_vocab=list(map(str, range(24)))
_tfms = MultiCategorize(vocab=_vocab)
_tfms((['21','3','9'],['21','11','8']))
#>>> (TensorMultiCategory([21,  3,  9]), TensorMultiCategory([21, 11,  8]))

Or if we want to model more complex labels:

_vocab=list(map(str, range(34)))
_tfms = MultiCategorize(vocab=_vocab)
_tfms((['11'],(['21'],['22'],((['31'],),))))

#>>> (TensorMultiCategory([11]),
(TensorMultiCategory([21]),
TensorMultiCategory([22]),
((TensorMultiCategory([31]),),)))