I trained a language model following the tutorials but when I try to get predictions the same 2 characters appear on the output.
For instance:
TEXT = ‘’
learn.predict(TEXT, 6, temperature=0.75,sep=‘’)
‘GosCCC’
TEXT = ‘GN’
learn.predict(TEXT, 6, temperature=0.75,sep=‘’)
‘GNGosCCC’
I realized the problem was my databunch. The ‘os’ string is being added to the beginning of every batch. I do have tokens ‘o’ and ‘s’, but they never appear like this on my data. In addition, when I call my Tokenizer, the tokens are perfect and match what I expected.
Tokenizer:
sample = ‘CC(C)C(CO)NCc1cc(C(F)(F)F)cc(-c2ccc(C(F)(F)F)nc2)n1’
tokenizer = Tokenizer(MolTokenizer,pre_rules=,post_rules=)tok = MolTokenizer()
print(‘’.join(tokenizer.process_text(sample, tok)))
‘!CC(C)C(CO)NCc1cc(C(F)(F)F)cc(-c2ccc(C(F)(F)F)nc2)n1EAAAAAAAAAAAAAAAAAAAAAAAA’
Result of data.show_batch():
!osCC(C)C(CO)NCc1cc(C(F)(F)F)cc(-c2ccc(C(F)(F)F)nc2)n1EAAAAAAAAAAAAAAAAAAAAAA
The ‘os’ is added right after the first character (‘!’)
Have anybody seen this before? How can I solve this?