Create Language Model for Chemical Structures

How do we improvise a drug to make it better than the existing version using Deep Learning. How can we redesign a drug that can interfere interaction between two proteins? How we can use Deep Learning to design peptides that can interfere protein/protein interactions. Can you please point me to some good deep learning papers in this space.

… completely agree, have been looking into molecular graphs a while ago.

Can anyone in the group help me with this?

I adopted the UMLFiT method and applied it to chemical properties prediction tasks. I pre-trained a ‘language model’ with 1 million molecules (SMILES) from ChEMBL and then fine-tuned the pre-trained model on other chemical properties prediction tasks.

Inductive Transfer Learning for Molecular Activity Prediction: Next-Gen QSAR Models with MolPMoFiT

4 Likes

@jeremy, I am working extensively on this as a part of my post doc. If you would like any information, I would be happy to share.

@aparente, do you have any example code implementing graph convs for chemical structures in fastai I could take a look at?

@Xinhao, I noticed in your work you enumerated the smiles before training and merged all the files into one training set. It is more customary to apply data augmentation in an online setting. I am currently working on implementing smiles enumeration and sampling as a custom callbacks. Here is my current code:

class SampleSMILES(LearnerCallback):
    def __init__(self, learn:Learner, path, vocab, debug):
        super().__init__(learn)
        self.path, self.vocab, self.debug = path, vocab,debug
        self.encode_dict = MolTokenizer(lang='en').encode_dict
        self.max_seq_length = 150
    def confirm_vocab(self, epoch):
        if( self.learn.data.train_ds.x.vocab != self.vocab):
            print('non equal vocabs in sample smiles on epoch:', epoch )
        else:
            print('we have passed vocab check in sample smiles on epoch:', epoch )
        print('print vocab for epoch end:', epoch, self.vocab.stoi)
    def log_sampler_results(self, smiles, batch_sample, epoch):
        #----log number of valid compounds made on this epoch
        valid = 0                                                                                                                                                                                                     
        for smi in smiles:                                                                                                                                                                                            
            mol = Chem.MolFromSmiles(smi)                                                                                                                                                                             
            if( smiles != '' and mol is not None and mol.GetNumAtoms() > 0 ):                                                                                                                                         
                valid+=1  
        f1 = open(self.path + 'valid_smiles.txt','a')
        if( epoch == 0):
            f1.write('number of valid smiles, batch sample size, epoch' + '\n')
        f1.write( str(valid) + ',' + str(batch_sample) + ',' + str(self.max_seq_length) + ',' + str( epoch ) + '\n')
        f1.close()
        if( self.debug==True):
            print('number of valid compounds:', valid)
    def decode_smi(self, smiles ):
        #---replace encoded tokens with chemicals
        temp_smiles = smiles
        for symbol, token in self.encode_dict.items():
            temp_smiles = temp_smiles.replace(token,symbol)
        return temp_smiles
    def action_to_smiles(self, array, epoch):
        #---convert action tensor to smiles
        smiles_strings = []
        for row in array:
            predicted_chars = []
            for j in row:
                next_char = self.vocab.itos[j.item()]
                if next_char == 'END':
                    break
                predicted_chars.append(next_char)
            smi = ''.join(predicted_chars)
            smi = self.decode_smi(smi)
            smiles_strings.append(smi)
        if( self.debug == True):
            print('we are now writing to file')
            f1 = open(self.path + 'a2s_'+str(epoch) + '.txt','w')
            for smi in smiles_strings:
                if( ' ' in smi):
                    print('we have a space in smi:', smi)
                f1.write( smi.strip() + '\n')
            f1.close()
        return smiles_strings
    def sampler(self,  epoch):
        #---sample batch of compounds at end of epoch
        self.learn.model.eval()
        with torch.no_grad():
            batch_sample=1024
            seqs_gen = ['']*batch_sample
            go_int = learner.data.train_ds.vocab.stoi['GO']
            xb = torch.from_numpy( np.array( [go_int]*batch_sample ) ).to(device='cuda').unsqueeze(1)
            yb = torch.from_numpy( np.array( [go_int]*batch_sample ) ).to(device='cuda').unsqueeze(1)
            actions = torch.zeros((batch_sample, self.max_seq_length), dtype=torch.long).to(device='cuda')
            for i in range(0, self.max_seq_length):
                output = self.learn.model(xb)[0].squeeze()
                output_probs = F.softmax(output, dim=1)
                m = torch.distributions.Multinomial(probs=output_probs)
                action = m.sample()
                idx = action.nonzero()[:,1]
                actions[:,i] = idx
                xb = idx.unsqueeze(1)
        smiles = self.action_to_smiles(actions, epoch)
        self.log_sampler_results( smiles , batch_sample,  epoch)
        self.learn.model.train()
    def export_learner(self, epoch):
        self.learn.export('learner_'+str(epoch) + '.pkl')
    def on_epoch_end(self, **kwargs):
        #===unpack kwargs
        epoch = kwargs['epoch']
        print('beginning sample:', self.max_seq_length, epoch)
        epoch = kwargs['epoch']
        if( self.debug == True):
            self.confirm_vocab(epoch)
            self.export_learner(epoch)
        self.sampler(epoch)
        if(self.debug == True):
            print('mode of model:', self.learn.model.training)
            print('we have completed sampler')

class ShuffleSMILES(LearnerCallback):
    def __init__(self, learn:Learner, path, bs, vocab, tok, train_data, valid_data, debug):
        super().__init__(learn)
        self.path, self.bs, self.vocab, self.tok = path, bs, vocab, tok
        self.train_data, self.valid_data = train_data, valid_data
        self.mol_tok = MolTokenizer(lang='en')
        self.debug = debug
        print('init shuffle smiles:', self.path, self.bs, self.path)
        print('init shuffle smiles vocab:')
        print( self.vocab.itos)
    def confirm_vocab(self, epoch):
        if( self.learn.data.train_ds.x.vocab != self.vocab):
            print('non equal vocabs in shuffle smiles on epoch:', epoch )
        else:
            print('we have passed vocab check in shuffle smiles on epoch:', epoch )
        print('print vocab for epoch:', epoch, self.vocab.stoi)
    def check_vocab(self, smiles):
        tokens = self.tok.process_text(smiles, self.mol_tok)
        for tt in tokens:
            if( tt not in self.vocab.stoi.keys() ):
                return 0
        return 1
    def shuffle_pd(self, df, epoch):
        smiles = list( df.smiles )
        random_smiles = []
        for smi in smiles:
            try:
                mol = Chem.MolFromSmiles(smi)
                new_atom_order = list(range(mol.GetNumAtoms()))
                random.shuffle(new_atom_order)
                random_mol    = Chem.RenumberAtoms(mol, newOrder=new_atom_order)
                random_smi    = Chem.MolToSmiles(random_mol, canonical=False, isomericSmiles=False)
                in_vocab = self.check_vocab(random_smi)
                if( in_vocab == 1):
                    random_smiles.append(random_smi)
                else:
                    random_smiles.append(smi)
                    if( self.debug == True):
                        print('we have generated a compound with a new token:', smi, epoch)
            except:
                if( self.debug == True):
                    print('failed for:', smiles)
                random_smiles.append(smi)
        df['random_smiles'] = random_smiles
        return df
    def on_epoch_begin(self, **kwargs):
        #===get state
        epoch = kwargs['epoch']
        print('beginning shuffle on :', epoch)
        #===shuffle and prep data
        self.train_data = self.shuffle_pd(self.train_data, epoch)
        self.valid_data = self.shuffle_pd(self.valid_data, epoch)
        self.train_data['r_len'] = self.train_data.random_smiles.apply(lambda x: len(x) )
        self.valid_data['r_len'] = self.valid_data.random_smiles.apply(lambda x: len(x) )
        print('max length on epoch:', self.train_data.r_len.max(), epoch)
        #===dump our new data frames to files
        if(self.debug == True):
            print('saving data for later analysis')
            self.train_data.to_csv(self.path + 'train_'+str(epoch) + '.csv.gzip',index=None,compression='gzip')
            self.valid_data.to_csv(self.path + 'valid_'+str(epoch) + '.csv.gzip',index=None,compression='gzip')
        #====prepare databunch
        self.newData = TextLMDataBunch.from_df(self.path, self.train_data, self.valid_data, text_cols='random_smiles', bs=self.bs, tokenizer=self.tok, vocab=self.vocab, min_freq=1, include_bos=False, include_eos=False)
        self.learn.data.train_dl.dl, self.learn.data.valid_dl.dl = self.newData.train_dl.dl, self.newData.valid_dl.dl
        if( self.debug == True):
            self.confirm_vocab(epoch)
            self.newData.save( 'newData_' + str(epoch)+'.pkl')
            torch.save( self.learn.data.train_ds.x.vocab.stoi, self.path + '/' + 'vocab_' + str(epoch) + '.pkl')
        print('we have completed on epoch begin')

It is currently giving me bizarre training behavior however. The validation loss in continually decreasing, However, the number of valid molecules sampled after an epoch of training goes up to about 95% by ~epoch 13, and then suddenly starts getting worse each epoch. For example, here are my training stats (epoch, cross entropy training loss, cross entropy validation loss, accuracy):

we have completed on epoch begin
0         0.735757    0.734926    0.734893  58:55     
--
we have completed on epoch begin
1         0.692634    0.692826    0.748313  59:24     
--
we have completed on epoch begin
2         0.677761    0.676239    0.753382  1:00:02   
--
we have completed on epoch begin
3         0.673893    0.674327    0.753982  1:00:17   
--
we have completed on epoch begin
4         0.682018    0.680836    0.751632  1:00:18   
--
we have completed on epoch begin
5         0.690983    0.690529    0.748319  1:00:11   
--
we have completed on epoch begin
6         0.693963    0.692387    0.747694  1:01:02   
--
we have completed on epoch begin
7         0.687516    0.685323    0.749805  1:00:52   
--
we have completed on epoch begin
8         0.677629    0.676371    0.752961  1:00:31   
--
we have completed on epoch begin
9         0.670770    0.668978    0.755372  1:01:01   
--
we have completed on epoch begin
10        0.663913    0.662514    0.757452  1:01:11   
--
we have completed on epoch begin
11        0.657783    0.657987    0.759115  1:01:29   
--
we have completed on epoch begin
12        0.654865    0.652931    0.760745  1:01:13   
--
we have completed on epoch begin
13        0.649611    0.648693    0.761964  1:01:12   
--
we have completed on epoch begin
14        0.646384    0.644538    0.763589  1:01:36   
--
we have completed on epoch begin
15        0.642046    0.640353    0.764990  1:01:32   
--
we have completed on epoch begin
16        0.640199    0.638433    0.765466  1:01:08   
--
we have completed on epoch begin
17        0.637574    0.634706    0.766821  1:01:38   
--
we have completed on epoch begin
18        0.634607    0.633236    0.767360  1:01:24   
--
we have completed on epoch begin
19        0.632348    0.630206    0.768307  1:01:16   

and here are the number of valid smiles generated after each epoch:

number of valid smiles, batch sample size, max seq length, epoch
625,1024,150,0
766,1024,150,1
817,1024,150,2
686,1024,150,3
865,1024,150,4
862,1024,150,5
667,1024,150,6
825,1024,150,7
926,1024,150,8
877,1024,150,9
922,1024,150,10
926,1024,150,11
927,1024,150,12
918,1024,150,13
731,1024,150,14
407,1024,150,15
0,1024,150,16
1,1024,150,17
10,1024,150,18
339,1024,150,19

I am not sure what is going on. If anyone sees a bug in my code, please let me know.

I haven’t seen any molecular graph conv implementations in fastai and have also been interested in this. Deepchem currently is updating to TF2 and a lot of changes will be coming with that, currently the best tool for GCNs that I know of. If you find any implementations in fastai, please let me know!

1 Like

I managed to train a language model to create molecules from SMILES. It took me a whole day, because some things had to be customized. I believe @jeremy will mention them on Part 2, but I still didn’t have the time to try.
To summarize my approach there are a few things you can do

  1. Create a custom Tokenizer to process the SMILES char by char. For instance, c1ccccc1 (i.e., benzene) should produce 2 tokens: c and 1. I took the idea from this notebok LetterTokenizer

  2. Define a function to mark the start and end of SMILES. For instance, !c1ccccc1N? indicates the chain starts with c (aromatic carbon) and ends with N (aliphatic nitrogen). You can use this function as a rule to the Tokenizer (pre rule). It also helps to define pos rules in order to convert the predictions to real SMILES.

  3. Select a way to deal with characters like nH and the halogens Br and Cl. These should be treated as a single token.

  4. Datablock API is your friend :slight_smile:

As a side note, don’t bother if the model isn’t returning many viable molecules, at least for now. You can always optimize it or use brute force to output as many molecules as you wish.

Good luck!

Here’s my show_batch

data.show_batch()

idx text
0 = O ) C S c 2 n c ( N ) c ( N C ( = O ) c 3 c c c ( O C ) c c 3 ) c ( = O ) [ Nh ] 2 ) s 1 xxpad C O c 1 c c c ( C ( = O ) N c 2 c ( N ) n c ( S
1 O ) c 2 c c c c c 2 1 xxpad O = c 1 c ( C c 2 c c c ( F ) c c 2 ) c ( O ) c c n 1 C 1 C C C C C 1 xxpad C C C C c 1 c c c 2 c ( c 1 ) c ( O ) c ( C
2 ( = O ) C C 2 C ( = O ) N C C N 2 C c 2 c c c c c 2 C ) o 1 xxpad C O c 1 c c c ( N C 2 C C C N ( C ( = O ) C C c 3 c n [ Nh ] c 3 ) C 2 ) c c 1
3 n n c 3 c c c ( N C c 4 c c c o 4 ) n n 2 3 ) c c 1 xxpad O = C ( C C c 1 n n c 2 c c c ( N C c 3 c c c o 3 ) n n 1 2 ) N c 1 c c c ( F ) c c 1
4 1 c ( C ) c 2 c c c ( O C ( C ) C ( = O ) N 3 C C [ C @@ ] 4 ( O ) C C C C [ C @ H ] 4 C 3 ) c ( C ) c 2 o c 1 = O xxpad C O c 1 c c c 2 [ Nh ] c

And here’s some generated molecules.

3 Likes

Hi Marcos,
I found your approach 2 (mark start and end of Smiles string) interesting. Currently, I am using xxbos, but it does not really work. When generating Smiles strings, many of them contain xxbos somewhere in between.

In your show_batch I neither saw “!” nor “?”. Aren’t they added during the tokenization? What is xxpad used for?

Could you please show me how you implemented the tokenizer pre-rule?

I would not use and end character! I just spent weeks debugging this error:

Hi @Pepper1709
I changed the code a bit but I’m still using start (G), end (E) and padding (A) tokens. In summary, I’m trying to implement the approach described here.
When I used xxbos and xxeof my results were not so good. My current approach is:

  1. Create a very small vocab, only with common elements in drug-like molecules (e.g., C, H, N, O and halogens)

  2. Tokenize each molecule letter-by-letter. Every atom is a token.

  3. Pad the tokens with A’s to match the size of the largest Smiles

  4. Add start and end tokens

The model can create structures like I showed you, but many are still not valid. It just like Jeremy showed us on Lesson 4. It’s a start and needs optmization.

Here’s the current version of my databunch:

idx text
0 ) c ( N c 3 n c c c c 3 C ( = O ) O )
1 1 c ( N C c 3 c ( C ) c c c c 3 Cl ) n
2 A A A A A A A A A A A A A A A A A A A A
3 ( C O ) C ( C C C 1 2 ) C 3 E A A A A A
4 G o s C C O C ( = O ) c 1 n n ( C ( = O
5 A A A A A A A A A A A A A A A A A A A A
6 1 E A A A A A A A A A A A A A A A A A
7 ( Cl ) c c 2 ) N 1 C C N c 1 c c n
8 1 c c c ( N C ( = O ) N C 2 N = C (
9 1 E A A A A A A A A A A A A A A A A A A

And the tokenizer:

#https://gist.github.com/EdwardJRoss/86b31848a7951411de56f10f55e9de4e
class MolTokenizer(BaseTokenizer):
  "Character level tokenizer function."
  def __init__(self, lang:str='no_lang'): 
    super().__init__(lang=lang)

    atoms = ['Br', 'C', 'Cl', 'F', 'H', 'I', 'N', 'O', 'P', 'S']

    special = ['(', ')', '[', ']', '=', '#', '%', '0',
             '1', '2', '3', '4', '5','6', '7', '8', '9',
             '+', '-', 'c', 'n', 'o', 's']
    padding = ['G', 'A', 'E']

    self.table = sorted(atoms, key=len, reverse=True) + special + padding
    table_len = len(self.table)

    self.double_chars = list(filter(lambda x: len(x) == 2, self.table))
    self.single_chars = list(filter(lambda x: len(x) == 1, self.table))
    def tokenizer(self, t:str) -> List[str]:
        out = []
        i = 0
       while i < len(t):
            char1 = t[i]
            char2 = t[i:i+2]

            if char2 in self.double_chars:
                out.append(char2)
                i += 2
               continue

            if char1 in self.single_chars:
                out.append(char1)
                i+=1
                continue
            i += 1
        return ['G'] + out + ['E'] + ['A' for _ in range(75 - len(out))] # 75 = length of longest SMILES. Hard-coded because I was in a hurry.

     def add_special_cases(self, toks:Collection[str]): 
        pass
1 Like

hey @cdparks
thank you for the tip!
Can you share the code of the custom sampler with us? I see you used 2-character tokens for start and end. I want to check if that also happens with single-character tokens at start and end.

Sure @marcossantana, the code was in the link I posted, but here it is as a callback without using an END token.

> class SampleSMILES(LearnerCallback):
>     def __init__(self, learn:Learner, path, vocab, debug, num_sample):
>         super().__init__(learn)
>         print('we have created smiles sampler callback')
>         self.path, self.vocab, self.debug = path, vocab,debug
>         self.encode_dict = MolTokenizer(lang='en').encode_dict
>         self.max_seq_length = 150
>         self.batch_size = 1024
>         self.go_int = self.vocab.stoi['GO']
>         self.num_sample = num_sample
>     def confirm_vocab(self, epoch):
>         if( self.learn.data.train_ds.x.vocab != self.vocab):
>             print('non equal vocabs in sample smiles on epoch:', epoch )
>         else:
>             print('we have passed vocab check in sample smiles on epoch:', epoch )
>         print('print vocab for epoch end:', epoch, self.vocab.stoi)
>     def log_sampler_results(self, smiles, batch_sample, epoch):
>         #----log number of valid compounds made on this epoch
>         valid = 0                                                                                                                                                                                                     
>         for smi in smiles:                                                                                                                                                                                            
>             mol = Chem.MolFromSmiles(smi)                                                                                                                                                                             
>             if( smiles != '' and mol is not None and mol.GetNumAtoms() > 0 ):                                                                                                                                         
>                 valid+=1  
>         return valid
>     def decode_smi(self, smiles ):
>         #---replace encoded tokens with chemicals
>         temp_smiles = smiles
>         for symbol, token in self.encode_dict.items():
>             temp_smiles = temp_smiles.replace(token,symbol)
>         return temp_smiles
>     def action_to_smiles(self, array, epoch):
>         #---convert action tensor to smiles
>         smiles_strings = []
>         for row in array:
>             predicted_chars = []
>             for j in row:
>                 next_char = self.vocab.itos[j.item()]
>                 if next_char == 'GO':
>                     break
>                 predicted_chars.append(next_char)
>             smi = ''.join(predicted_chars)
>             smi = self.decode_smi(smi)
>             smiles_strings.append(smi)
>         return smiles_strings
>     def sampler(self,  epoch, current_batch_size):
>         #---sample batch of compounds at end of epoch
>         seqs_gen = ['']*self.batch_size
>         go_int = learner.data.train_ds.vocab.stoi['GO']
>         xb = np.array( [go_int]*self.batch_size )
>         xb = torch.from_numpy(xb).to(device='cuda').unsqueeze(1)
>         actions = torch.zeros((self.batch_size, self.max_seq_length), dtype=torch.long).to(device='cuda')
>         learner.model.eval()
>         learner.model.reset()
>         with torch.no_grad():
>             for i in range(0, self.max_seq_length):
>                 output = learner.model(xb)[0].squeeze()
>                 output_probs = F.softmax(output, dim=-1)
>                 output_probs[:,learner.data.train_ds.x.vocab.stoi[UNK] ] = 0    
>                 action = torch.multinomial(output_probs,num_samples=1)
>                 xb = action
>                 actions[:,i] = action.squeeze()
>                 if( torch.sum(action) == 0):
>                     break
>         smiles = self.action_to_smiles(actions, epoch)
>         return smiles
>     def reset_model(self):
>         self.learn.model.reset()
>         self.learn.model.train()
>     def run(self, num, epoch):
>         """
>         Samples the model for the given number of SMILES.
>         :params num: Number of SMILES to sample.
>         """
>         num_batches = math.ceil(num / self.batch_size)
>         molecules_left = num
>         smiles = []
>         for _ in range(num_batches):
>             current_batch_size = min(molecules_left, self.batch_size)
>             smiles += self.sampler(epoch, current_batch_size)
>             molecules_left -= current_batch_size
>         valid = self.log_sampler_results( smiles , num,  epoch)
>         print( valid, num )
>         #print('check gradients:', self.learn.model[0].encoder.weight.grad)
>         self.reset_model()
>         return valid
>     def on_epoch_end(self, **kwargs):
>         #===unpack kwargs
>         print('beginning sample:', self.max_seq_length, epoch)
>         epoch = kwargs['epoch']
>         self.run(self.num_sample, epoch)
>         self.confirm_vocab(epoch)
>         print('mode of model:', self.learn.model.training)
>         print('check gradients:', self.learn.model[0].encoder.weight.grad)
>         print('we have completed sampler')

my tokenizer looked like this

BOS,EOS,FLD,UNK,PAD = 'xxbos','xxeos','xxfld','xxunk','xxpad'
TK_MAJ,TK_UP,TK_REP,TK_WREP = 'xxmaj','xxup','xxrep','xxwrep'

defaults.text_spec_tok = [PAD]

class MolTokenizer(BaseTokenizer):
    def __init__(self, lang):
        self.encode_dict = {"Br": 'Y', "Cl": 'X', "Si": 'A', 'Se': 'Z', '@@': 'R', 'se': 'E'}
        pass
    def tokenizer(self, smiles):
        temp_smiles = smiles
        for symbol, token in self.encode_dict.items():
            temp_smiles = temp_smiles.replace(symbol, token)
        tokens = list(temp_smiles)
        tokens = ['GO'] + tokens 
        return tokens    
    
    def add_special_cases(self, toks):
        pass

Another thing to look out for that I have just come across is the padding_idx. When I created a custom vocabulary, my padding_idx was set to zero. But if you look in all of the language model functions, and the awd_lstm_lm_config dict, fastai sets it to 1 as default when constructing the models! It is unfortunate that the padding_idx is hard coded like this, and not read in from the vocab that is present in the databunch when you create the learner. When you construct your learner for language modeling, I believe you should always do this:

config = awd_lstm_clas_config.copy()                                                                                                                                                                                                                            
config['pad_token'] =data.train_ds.x.vocab.stoi['xxpad']                                                                                                                                                                                                                                          
learner = text_classifier_learner(data, AWD_LSTM, drop_mult=drops, wd=wd, pretrained=False, config=config)     
learner.loss_func.ignore_index = learner.data.train_ds.x.vocab.stoi['xxpad']

Have others come across or noticed this padding_index issue? I am not fully sure what the repercussions are for the initial language model. But when transfer learning to create a regressor, if the padding_idx is not set correctly, the model will not account for it correctly when creating the mask.

1 Like

Thank you :slight_smile:
I’m not using a custom sampler right now but I will check yours. It could be the reason why my % of valid SMILES is so low.

Btw, padding didn’t seem to affect much of my training. With or without padding my results were the same.

Hi @marcossantana,

Yes, I am now finding that as well. It appears that FastAI doesn’t use padding when creating the batches for the initial language model training phase. For some reason, the padding embedding still chances during training though. I just posted a question about that:

Where handling padding well really matters is during the classification/regression phase, as the model needs to mask the padding token when featurizing the text. This happens in the masked_concat_pool function. I was getting really bad metrics when trying to use my pre-trained LSTM to regress pIC50s. I am currently trying to figure out if there was some issue with the padding token.

1 Like

Hmmm maybe pad_token = 1 means that when ‘xxpad’ is numericalized it get the number 1.

My answer after training a model using fastai: YES.

I trained a model on a huge collection of molecules (~1million) and then fined-tuned it with a very small dataset (~400). It was able to generate 99% novel molecules and scaffolds.

1 Like

Hi @marcossantana, what did your final tokenizer look like? How do you handle padding? do you prepad your data before training your language model?

I’m using this tokenizer now to work with protein data, but I’m not sure if it’s the best:

defaults.text_pre_rules = []
defaults.text_post_rules = []

class LetterTokenizer(BaseTokenizer):
    "Character level tokenizer function."
    def __init__(self, lang): pass
    def tokenizer(self, t:str) -> List[str]:
        out = [BOS]
        i = 0
        while i < len(t):
            out.append(t[i])
            i += 1
        return out
            
    def add_special_cases(self, toks:Collection[str]): pass

char_tokenize_processor = TokenizeProcessor(tokenizer=Tokenizer(tok_func=LetterTokenizer), include_bos=False)

data_lm = (TextList
              .from_csv(path='../',csv_name='100k.csv', 
                           processor=[OpenFileProcessor(), char_tokenize_processor, NumericalizeProcessor()])
              .split_by_rand_pct(0.01)
              .label_for_lm()
              .databunch(bs=256,bptt=128))
idx text
0 I I S T V xxbos M E R L K K E E E E K L K E V E A E E E E E E E E E E E E E E E E I P L Q R N V R R T G E G E S S G T A E E E K L E K M V S
1 F D E K L I I Y G R D V Y L K D G R L I F E S I A D A D H I R S S I V N D G T K Q P I L E E L K G F T S S K S A F M T A T K E L S E A A V F
2 S D A E A K L K N G V H C L E V xxbos M L D G R R S K E N W G V K H N P T T D A I F I L A E T G T C M I H I A C H F D G E K L K L Q L T K V
3 E R T G R P L G A D N F I A W L E N V L G R M L H K Q K S A I S F G V A G A A C R G R xxbos M G P G S R V H T V Q T L V V G G G V V G L S A
4 P K V K R P E A G V T G R E M L L W D F K D M N Q E G L E N I W A A L D D V V M G G V S L S N I K L A E H G A T F S G E T S S R N S G G F C

Right now I don’t have any padding in my input.

I used pretty much standard stuff. I added xxbos and xxeos tokens but no special padding.
My tokenizer looks like this. Pretty simple compared to my previous version…

My problem was the sampling. Fastai keeps adding tokens until the max_size.

Ah
Beware when sampling! If you just use the predict method you might get horrible results. That’s because it wont stop adding tokens until the sentence reaches a predefined size. Try to modify it to include a stop token.

Hi @cdparks
Did you manage to improve your results?