The Vocab
Class has a textify
method that is defined as follows :
def textify(self, nums:Collection[int], sep=' ') -> List[str]:
"Convert a list of `nums` to their tokens."
return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]
this allow the user to specify a separator that is different than a space for the reconstruction of the text from the numericalized version. For instance, in a character LM, where space is just a character among the others the empty separator makes more sense. Is is then possible to call vocab.textify(text, sep='')
to get the desired output. However, when calling the get method from the TextList
, the sep
keyword is not passed to textify :
def get(self, i):
o = super().get(i)
return Text(o, self.vocab.textify(o))
As a result, the texts is printed out wrong if you don’t want a space separator. One way around this is to subclass the Vocab
Class, or to override the textify
method to choose your default separator.
What could be done as well is to add a sep
attribute to a Vocab class which will be called by default by textify. I feel like this is the good place to put it since the separator will mostly depend on the vocab(at least I think so). The Vocab source would look like this :
class Vocab():
"Contain the correspondence between numbers and tokens and numericalize."
def __init__(self, itos:Collection[str], sep:str=' '):
self.itos = itos
self.stoi = collections.defaultdict(int,{v:k for k,v in enumerate(self.itos)})
# New arguments sep here :
self.sep = sep
def numericalize(self, t:Collection[str]) -> List[int]:
"Convert a list of tokens `t` to their ids."
return [self.stoi[w] for w in t]
def textify(self, nums:Collection[int], sep=None) -> List[str]:
"Convert a list of `nums` to their tokens."
# Choose the sep attribute if sep is not specified
sep = ifnone(sep, self.sep)
return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]
I have the following questions regarding this :
- Can you think of a smarter way to set a new
sep
argument in theget
call of theTextList
in the current fastai library? - Is adding the
sep
argument to theVocab
Class a good idea that will not break anything in the library?
If so, I’ll gladly make a PR regarding this.