Language model databunch is adding tokens to my data

I trained a language model following the tutorials but when I try to get predictions the same 2 characters appear on the output.
For instance:

TEXT = ‘’
learn.predict(TEXT, 6, temperature=0.75,sep=’’)

learn.predict(TEXT, 6, temperature=0.75,sep=’’)

I realized the problem was my databunch. The ‘os’ string is being added to the beginning of every batch. I do have tokens ‘o’ and ‘s’, but they never appear like this on my data. In addition, when I call my Tokenizer, the tokens are perfect and match what I expected.


sample = ‘CC©C(CO)NCc1cc(C(F)(F)F)cc(-c2ccc(C(F)(F)F)nc2)n1’
tokenizer = Tokenizer(MolTokenizer,pre_rules=[],post_rules=[])

tok = MolTokenizer()

print(’’.join(tokenizer.process_text(sample, tok)))


Result of data.show_batch():


The ‘os’ is added right after the first character (’!’)

Have anybody seen this before? How can I solve this?