I’m just trying to implement character tokenizer. What I just do is:
class CharTokenizer(BaseTokenizer):
def tokenizer(self, t):
return list(t.lower())
tok = Tokenizer(tok_func=CharTokenizer, pre_rules=[], post_rules=[], special_cases=[])
data_clas = TextClasDataBunch.from_csv(path, 'classified_messages.csv', bs=4, tokenizer=tok)
And I faced with a problem that all messages are prefixed with Text xxbos xxfld 1. As I see from the code the place where they are inserted is get_processor method:
def _get_processor(tokenizer:Tokenizer=None, vocab:Vocab=None, chunksize:int=10000, max_vocab:int=60000, min_freq:int=2, mark_fields:bool=True, **kwargs)
And as far as I understood the only way not to insert them is to provide mark_fileds=False. So what I’ve tried to do is provide mark_fields=False to TextClasDataBunch.from_csv method:
data_clas = TextClasDataBunch.from_csv(path, 'classified_messages.csv', bs=4, tokenizer=tok, mark_fields=False)
But as a result I got an error:
~/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/fastai/text/data.py in create(cls, train_ds, valid_ds, test_ds, path, bs, pad_idx, pad_first, **kwargs)
241 collate_fn = partial(pad_collate, pad_idx=pad_idx, pad_first=pad_first)
242 train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)
→ 243 train_dl = DataLoader(datasets[0], batch_size=bs//2, sampler=train_sampler, **kwargs)
244 dataloaders = [train_dl]
245 for ds in datasets[1:]:TypeError: init() got an unexpected keyword argument ‘mark_fields’
So may be TextClasDataBunch.from_csv should have explicit argument mark_field ? Or may me there is another way to process data without any prefixes?