Separator in textify, should a 'sep' attribute be added to Vocab class

The Vocab Class has a textify method that is defined as follows :

def textify(self, nums:Collection[int], sep=' ') -> List[str]:
    "Convert a list of `nums` to their tokens."
    return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]

this allow the user to specify a separator that is different than a space for the reconstruction of the text from the numericalized version. For instance, in a character LM, where space is just a character among the others the empty separator makes more sense. Is is then possible to call vocab.textify(text, sep='') to get the desired output. However, when calling the get method from the TextList, the sep keyword is not passed to textify :

def get(self, i):
        o = super().get(i)
        return Text(o, self.vocab.textify(o)) 

As a result, the texts is printed out wrong if you don’t want a space separator. One way around this is to subclass the Vocab Class, or to override the textify method to choose your default separator.

What could be done as well is to add a sep attribute to a Vocab class which will be called by default by textify. I feel like this is the good place to put it since the separator will mostly depend on the vocab(at least I think so). The Vocab source would look like this :

class Vocab():
    "Contain the correspondence between numbers and tokens and numericalize."
    def __init__(self, itos:Collection[str], sep:str=' '):
        self.itos = itos
        self.stoi = collections.defaultdict(int,{v:k for k,v in enumerate(self.itos)})
        # New arguments sep here :
        self.sep = sep

    def numericalize(self, t:Collection[str]) -> List[int]:
        "Convert a list of tokens `t` to their ids."
        return [self.stoi[w] for w in t]

    def textify(self, nums:Collection[int], sep=None) -> List[str]:
        "Convert a list of `nums` to their tokens."
        # Choose the sep attribute if sep is not specified
        sep = ifnone(sep, self.sep)
        return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]

I have the following questions regarding this :

  • Can you think of a smarter way to set a new sep argument in the get call of the TextList in the current fastai library?
  • Is adding the sep argument to the Vocab Class a good idea that will not break anything in the library?
    If so, I’ll gladly make a PR regarding this.

I think having a sep attribute is fine but I would put it in TextList, and pass it along in get. Any PR with this would be welcome, thanks for the offer to help!

The PR has been pushed and is currently facing the checks :slightly_smiling_face: