Per character Tokenizer

I’m just trying to implement character tokenizer. What I just do is:

class CharTokenizer(BaseTokenizer):
    def tokenizer(self, t):
        return list(t.lower())
tok = Tokenizer(tok_func=CharTokenizer, pre_rules=[], post_rules=[], special_cases=[])
data_clas = TextClasDataBunch.from_csv(path, 'classified_messages.csv', bs=4, tokenizer=tok)

And I faced with a problem that all messages are prefixed with Text xxbos xxfld 1. As I see from the code the place where they are inserted is get_processor method:

def _get_processor(tokenizer:Tokenizer=None, vocab:Vocab=None, chunksize:int=10000, max_vocab:int=60000, min_freq:int=2, mark_fields:bool=True, **kwargs)

And as far as I understood the only way not to insert them is to provide mark_fileds=False. So what I’ve tried to do is provide mark_fields=False to TextClasDataBunch.from_csv method:

data_clas = TextClasDataBunch.from_csv(path, 'classified_messages.csv', bs=4, tokenizer=tok, mark_fields=False)

But as a result I got an error:

~/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/fastai/text/ in create(cls, train_ds, valid_ds, test_ds, path, bs, pad_idx, pad_first, **kwargs)
241 collate_fn = partial(pad_collate, pad_idx=pad_idx, pad_first=pad_first)
242 train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)
–> 243 train_dl = DataLoader(datasets[0], batch_size=bs//2, sampler=train_sampler, **kwargs)
244 dataloaders = [train_dl]
245 for ds in datasets[1:]:

TypeError: init() got an unexpected keyword argument ‘mark_fields’

So may be TextClasDataBunch.from_csv should have explicit argument mark_field ? Or may me there is another way to process data without any prefixes?


If I understand correctly I think we want a beginning of string token to allow us to process many strings at once (this token tells the model that they are separate strings). But we need to parse this token separately:

class LetterTokenizer(BaseTokenizer):
    "Basic class for a tokenizer function."
    def __init__(self, lang): pass
    def tokenizer(self, t:str) -> List[str]:
        out = []
        i = 0
        while i < len(t):
            if t[i:].startswith(BOS):
                i += len(BOS)
                i += 1
        return out
    def add_special_cases(self, toks:Collection[str]): pass

But we still don’t want to mark the fields when there is only one. Like you say TextClasDataBunch.from_csv doesn’t pass a mark_fields on to TokenizeProcessor. But TextClasDataBunch.from_df does; so if you first read in the data with pd.read_csv you can do something like:

import string
all_letters = string.ascii_letters + " .,;'"
vocab=Vocab.create(all_letters, max_vocab=1000, min_freq=0)

tokenizer=Tokenizer(LetterTokenizer, pre_rules=[], post_rules=[])

data = TextClasDataBunch.from_df(path='.', train_df=train_df, valid_df=valid_df,
                         tokenizer=tokenizer, vocab=vocab,

Another option is to use the flexible data block interface:

import string
all_letters = string.ascii_letters + " .,;'"
tokenizer=Tokenizer(LetterTokenizer, pre_rules=[], post_rules=[])
processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
            NumericalizeProcessor(vocab=Vocab.create(all_letters, max_vocab=1000, min_freq=0))]

data = (TextList

Hi, I have been trying to make a character-level language model using fastai library. I’ve used your tokenizer, but it seems to almost run our of memory and give an error “BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.”
Any help will be highly appreciated. or link to someone who has made a character-level LM in fastai. Thank you.

Sorry - I’m not too sure what could be happening here.

A broken process pool sounds like it could be the dataloader. I’d try with a single worker and see if you can get that working - it’ll make it much easier to debug at least.

Here’s an example of a Character-Level RNN in fastai - but the version of fastai used is likely a bit outdated (1.12ish?)

Yes I had followed this tutorial of yours itself. Got it from another thread in the forum. For now i have managed to get it working with only half the dataset. the entire dataset if loaded at once gives the error as i mentioned. character level databunch seems to takes up more memory than word level. Not sure why.

Right - when you say run out of memory is that RAM or GPU memory?

Is it happening when training the language model or the classifier?

A character level RNN will have fewer weights than a word level (the vocabulary of possible characters is thousands of times smaller than comman word vocabulary sizes so the embedding layer is much smaller), but will have more tokens (around 5x for English text).

I’d expect RAM usage to be higher, GPU usage to be lower for language modelling. For classification you may need to truncate to a maximum number of characters to preserve GPU memory.

I run out of RAM. I used colab. It happens when training the language model. Entire dataset fits fine for classifier.