Unable to decode items

dangraf · June 10, 2020, 2:07pm

I’m trying to create a Transform for my y-block in the data-loader but the decode-function is never called but the encode function works fine and is called correctly.

I is decoding when I’m calling it separately but not when it’s used with the data loader.

class CustomText(str):
    """Helper function to be able to show the label of the data"""
    def show(self, ctx=None, **kwargs): 
        return show_title(self, ctx=ctx, **kwargs)


class CustomTokenizer(Transform):
    """ Converts characters to numbers and vise verse"""*
    def __init__(self, df, char_limit=100, str_max_len=12):
        self.df = df
        self.str_max_len = str_max_len+2 # need to add start and stop
        self.tokenstats = dict()   
        df['text'].apply(self.count_letters)
        

         chars  = [ c for c in sorted(self.tokenstats.keys()) if self.tokenstats[c]>char_limit]
        all_chars = ['#pad#','#stop#','#unk#'] + chars
        self.o2i = {c:i for i, c in enumerate(all_chars)}
        self.vocab = {self.o2i[c]:c for c in self.o2i.keys()}

    def count_letters(self, st):
        for c in st:
            n = self.tokenstats.get(c, 0) +1
            self.tokenstats[c]=n
            
    def encodes(self, x:CustomText):
        print('encodes')
        tokens = np.array([self.o2i.get(c, self.o2i['#unk#']) for c in x])
        tokens = np.pad(plate, pad_width=(1, self.str_max_len-len(plate)-1), constant_values=self.o2i['#pad#'])
        tokens[len(x)+1] = self.o2i['#stop#']
        # how to add endchar?
        return TensorText(tokens)

    def decodes(self, x):
        print('decodes')
        encoded = [self.vocab.get(n, '#unk#') for n in x.cpu().detach().numpy() if n != self.o2i['#pad#'] and n!= self.o2i["#stop#"]]
        return CustomText(''.join(encoded))

class Yblock(): 
    def __init__(self,df):
        self.df= df
    def __call__(self):   
        ltok = LicenseTokenizer(self.df)
        return TransformBlock(item_tfms=[ltok])

yblock = Yblock(df)
data = DataBlock(blocks=(ImageBlock, yblock),
          get_items=get_items,
         get_x=get_x, get_y=get_y,
                 item_tfms = [Resize((80,224), method='pad', pad_mode='border')],
         splitter=RandomSplitter())

dls = data.dataloaders(df,bs=4)
batch = dls.train.one_batch()
decoded = dls.train.decode(b)

The decoded batch is of type TextTensor but I’m expecting it to be of type “str” and the decode function never prints out “decodes”.
The data.summary() function does not throw any errors.

How do I make it to be called correctly or ideas of how to debug this problem?

muellerzr · June 10, 2020, 2:23pm

To decode a batch you should use decode_batch

It’s format is:

learn.dls.decode_batch((*tuplify(batch[0]), *tuplify(batch[1])))

(where batch[0] is input and batch[1] is output)

dangraf · June 10, 2020, 2:49pm

Great, it works now.
One more question popped up to gain better understanding.
If I’m using Categorize as the y-block no type annotations seems to be used in either encode and decode functions and only the y-part is being decoded/decoded. But i my case I need to use type annotations i both encoder and decoder or it will also try do decode Images. My question is: How do I know when to use the type annotations and not? Is it possible to solve this problem without the helper class of TensorText?

Tendo · June 10, 2020, 4:16pm

If I’m not mistaken, the CustomText and TensorText you added are to add the .show method to your output. IIRC encodes expects inputs with a show method. Out of curiosity, does using the TensorText type in the encodes declaration instead of CustomText work?

dangraf · June 10, 2020, 5:26pm

yes, it works perfectly

Tendo · June 10, 2020, 5:37pm

Awesome…it shows that encodes needs a type with a show method