In the Tokenizer
class in fastai.text.transform.py
we have the following code :
class Tokenizer():
"Put together rules and a tokenizer function to tokenize text with multiprocessing."
def __init__(self, tok_func:Callable=SpacyTokenizer, lang:str='en', pre_rules:ListRules=None,
post_rules:ListRules=None, special_cases:Collection[str]=None, n_cpus:int=None):
self.tok_func,self.lang,self.special_cases = tok_func,lang,special_cases
self.pre_rules = ifnone(pre_rules, defaults.text_pre_rules )
self.post_rules = ifnone(post_rules, defaults.text_post_rules)
self.special_cases = special_cases if special_cases else defaults.text_spec_tok
self.n_cpus = ifnone(n_cpus, defaults.cpus)
...
def _process_all_1(self, texts:Collection[str]) -> List[List[str]]:
"Process a list of `texts` in one process."
tok = self.tok_func(self.lang)
if self.special_cases: tok.add_special_cases(self.special_cases)
return [self.process_text(str(t), tok) for t in texts]
As a result, if I want to create a Tokenizer
with no special_cases and I pass special_cases=[]
to the constructor, self.special_cases
will default to defaults.text_spec_tok
making it have some special cases. This seems like an unintended behavior to me.
I would suggest replacing
self.special_cases = special_cases if special_cases else defaults.text_spec_tok
by
self.special_cases = special_cases if special_cases is not None else defaults.text_spec_tok
I don’t think it has unexpected consequences to do that change but I might be wrong. If you think that this change should me made, i can make a PR (by adding 9 characters ).