Create Language Model for Chemical Structures

Hi @Pepper1709
I changed the code a bit but I’m still using start (G), end (E) and padding (A) tokens. In summary, I’m trying to implement the approach described here.
When I used xxbos and xxeof my results were not so good. My current approach is:

  1. Create a very small vocab, only with common elements in drug-like molecules (e.g., C, H, N, O and halogens)

  2. Tokenize each molecule letter-by-letter. Every atom is a token.

  3. Pad the tokens with A’s to match the size of the largest Smiles

  4. Add start and end tokens

The model can create structures like I showed you, but many are still not valid. It just like Jeremy showed us on Lesson 4. It’s a start and needs optmization.

Here’s the current version of my databunch:

idx text
0 ) c ( N c 3 n c c c c 3 C ( = O ) O )
1 1 c ( N C c 3 c ( C ) c c c c 3 Cl ) n
2 A A A A A A A A A A A A A A A A A A A A
3 ( C O ) C ( C C C 1 2 ) C 3 E A A A A A
4 G o s C C O C ( = O ) c 1 n n ( C ( = O
5 A A A A A A A A A A A A A A A A A A A A
6 1 E A A A A A A A A A A A A A A A A A
7 ( Cl ) c c 2 ) N 1 C C N c 1 c c n
8 1 c c c ( N C ( = O ) N C 2 N = C (
9 1 E A A A A A A A A A A A A A A A A A A

And the tokenizer:

#https://gist.github.com/EdwardJRoss/86b31848a7951411de56f10f55e9de4e
class MolTokenizer(BaseTokenizer):
  "Character level tokenizer function."
  def __init__(self, lang:str='no_lang'): 
    super().__init__(lang=lang)

    atoms = ['Br', 'C', 'Cl', 'F', 'H', 'I', 'N', 'O', 'P', 'S']

    special = ['(', ')', '[', ']', '=', '#', '%', '0',
             '1', '2', '3', '4', '5','6', '7', '8', '9',
             '+', '-', 'c', 'n', 'o', 's']
    padding = ['G', 'A', 'E']

    self.table = sorted(atoms, key=len, reverse=True) + special + padding
    table_len = len(self.table)

    self.double_chars = list(filter(lambda x: len(x) == 2, self.table))
    self.single_chars = list(filter(lambda x: len(x) == 1, self.table))
    def tokenizer(self, t:str) -> List[str]:
        out = []
        i = 0
       while i < len(t):
            char1 = t[i]
            char2 = t[i:i+2]

            if char2 in self.double_chars:
                out.append(char2)
                i += 2
               continue

            if char1 in self.single_chars:
                out.append(char1)
                i+=1
                continue
            i += 1
        return ['G'] + out + ['E'] + ['A' for _ in range(75 - len(out))] # 75 = length of longest SMILES. Hard-coded because I was in a hurry.

     def add_special_cases(self, toks:Collection[str]): 
        pass
1 Like

hey @cdparks
thank you for the tip!
Can you share the code of the custom sampler with us? I see you used 2-character tokens for start and end. I want to check if that also happens with single-character tokens at start and end.

Sure @marcossantana, the code was in the link I posted, but here it is as a callback without using an END token.

> class SampleSMILES(LearnerCallback):
>     def __init__(self, learn:Learner, path, vocab, debug, num_sample):
>         super().__init__(learn)
>         print('we have created smiles sampler callback')
>         self.path, self.vocab, self.debug = path, vocab,debug
>         self.encode_dict = MolTokenizer(lang='en').encode_dict
>         self.max_seq_length = 150
>         self.batch_size = 1024
>         self.go_int = self.vocab.stoi['GO']
>         self.num_sample = num_sample
>     def confirm_vocab(self, epoch):
>         if( self.learn.data.train_ds.x.vocab != self.vocab):
>             print('non equal vocabs in sample smiles on epoch:', epoch )
>         else:
>             print('we have passed vocab check in sample smiles on epoch:', epoch )
>         print('print vocab for epoch end:', epoch, self.vocab.stoi)
>     def log_sampler_results(self, smiles, batch_sample, epoch):
>         #----log number of valid compounds made on this epoch
>         valid = 0                                                                                                                                                                                                     
>         for smi in smiles:                                                                                                                                                                                            
>             mol = Chem.MolFromSmiles(smi)                                                                                                                                                                             
>             if( smiles != '' and mol is not None and mol.GetNumAtoms() > 0 ):                                                                                                                                         
>                 valid+=1  
>         return valid
>     def decode_smi(self, smiles ):
>         #---replace encoded tokens with chemicals
>         temp_smiles = smiles
>         for symbol, token in self.encode_dict.items():
>             temp_smiles = temp_smiles.replace(token,symbol)
>         return temp_smiles
>     def action_to_smiles(self, array, epoch):
>         #---convert action tensor to smiles
>         smiles_strings = []
>         for row in array:
>             predicted_chars = []
>             for j in row:
>                 next_char = self.vocab.itos[j.item()]
>                 if next_char == 'GO':
>                     break
>                 predicted_chars.append(next_char)
>             smi = ''.join(predicted_chars)
>             smi = self.decode_smi(smi)
>             smiles_strings.append(smi)
>         return smiles_strings
>     def sampler(self,  epoch, current_batch_size):
>         #---sample batch of compounds at end of epoch
>         seqs_gen = ['']*self.batch_size
>         go_int = learner.data.train_ds.vocab.stoi['GO']
>         xb = np.array( [go_int]*self.batch_size )
>         xb = torch.from_numpy(xb).to(device='cuda').unsqueeze(1)
>         actions = torch.zeros((self.batch_size, self.max_seq_length), dtype=torch.long).to(device='cuda')
>         learner.model.eval()
>         learner.model.reset()
>         with torch.no_grad():
>             for i in range(0, self.max_seq_length):
>                 output = learner.model(xb)[0].squeeze()
>                 output_probs = F.softmax(output, dim=-1)
>                 output_probs[:,learner.data.train_ds.x.vocab.stoi[UNK] ] = 0    
>                 action = torch.multinomial(output_probs,num_samples=1)
>                 xb = action
>                 actions[:,i] = action.squeeze()
>                 if( torch.sum(action) == 0):
>                     break
>         smiles = self.action_to_smiles(actions, epoch)
>         return smiles
>     def reset_model(self):
>         self.learn.model.reset()
>         self.learn.model.train()
>     def run(self, num, epoch):
>         """
>         Samples the model for the given number of SMILES.
>         :params num: Number of SMILES to sample.
>         """
>         num_batches = math.ceil(num / self.batch_size)
>         molecules_left = num
>         smiles = []
>         for _ in range(num_batches):
>             current_batch_size = min(molecules_left, self.batch_size)
>             smiles += self.sampler(epoch, current_batch_size)
>             molecules_left -= current_batch_size
>         valid = self.log_sampler_results( smiles , num,  epoch)
>         print( valid, num )
>         #print('check gradients:', self.learn.model[0].encoder.weight.grad)
>         self.reset_model()
>         return valid
>     def on_epoch_end(self, **kwargs):
>         #===unpack kwargs
>         print('beginning sample:', self.max_seq_length, epoch)
>         epoch = kwargs['epoch']
>         self.run(self.num_sample, epoch)
>         self.confirm_vocab(epoch)
>         print('mode of model:', self.learn.model.training)
>         print('check gradients:', self.learn.model[0].encoder.weight.grad)
>         print('we have completed sampler')

my tokenizer looked like this

BOS,EOS,FLD,UNK,PAD = 'xxbos','xxeos','xxfld','xxunk','xxpad'
TK_MAJ,TK_UP,TK_REP,TK_WREP = 'xxmaj','xxup','xxrep','xxwrep'

defaults.text_spec_tok = [PAD]

class MolTokenizer(BaseTokenizer):
    def __init__(self, lang):
        self.encode_dict = {"Br": 'Y', "Cl": 'X', "Si": 'A', 'Se': 'Z', '@@': 'R', 'se': 'E'}
        pass
    def tokenizer(self, smiles):
        temp_smiles = smiles
        for symbol, token in self.encode_dict.items():
            temp_smiles = temp_smiles.replace(symbol, token)
        tokens = list(temp_smiles)
        tokens = ['GO'] + tokens 
        return tokens    
    
    def add_special_cases(self, toks):
        pass

Another thing to look out for that I have just come across is the padding_idx. When I created a custom vocabulary, my padding_idx was set to zero. But if you look in all of the language model functions, and the awd_lstm_lm_config dict, fastai sets it to 1 as default when constructing the models! It is unfortunate that the padding_idx is hard coded like this, and not read in from the vocab that is present in the databunch when you create the learner. When you construct your learner for language modeling, I believe you should always do this:

config = awd_lstm_clas_config.copy()                                                                                                                                                                                                                            
config['pad_token'] =data.train_ds.x.vocab.stoi['xxpad']                                                                                                                                                                                                                                          
learner = text_classifier_learner(data, AWD_LSTM, drop_mult=drops, wd=wd, pretrained=False, config=config)     
learner.loss_func.ignore_index = learner.data.train_ds.x.vocab.stoi['xxpad']

Have others come across or noticed this padding_index issue? I am not fully sure what the repercussions are for the initial language model. But when transfer learning to create a regressor, if the padding_idx is not set correctly, the model will not account for it correctly when creating the mask.

1 Like

Thank you :slight_smile:
I’m not using a custom sampler right now but I will check yours. It could be the reason why my % of valid SMILES is so low.

Btw, padding didn’t seem to affect much of my training. With or without padding my results were the same.

Hi @marcossantana,

Yes, I am now finding that as well. It appears that FastAI doesn’t use padding when creating the batches for the initial language model training phase. For some reason, the padding embedding still chances during training though. I just posted a question about that:

Where handling padding well really matters is during the classification/regression phase, as the model needs to mask the padding token when featurizing the text. This happens in the masked_concat_pool function. I was getting really bad metrics when trying to use my pre-trained LSTM to regress pIC50s. I am currently trying to figure out if there was some issue with the padding token.

1 Like

Hmmm maybe pad_token = 1 means that when ‘xxpad’ is numericalized it get the number 1.

My answer after training a model using fastai: YES.

I trained a model on a huge collection of molecules (~1million) and then fined-tuned it with a very small dataset (~400). It was able to generate 99% novel molecules and scaffolds.

1 Like

Hi @marcossantana, what did your final tokenizer look like? How do you handle padding? do you prepad your data before training your language model?

I’m using this tokenizer now to work with protein data, but I’m not sure if it’s the best:

defaults.text_pre_rules = []
defaults.text_post_rules = []

class LetterTokenizer(BaseTokenizer):
    "Character level tokenizer function."
    def __init__(self, lang): pass
    def tokenizer(self, t:str) -> List[str]:
        out = [BOS]
        i = 0
        while i < len(t):
            out.append(t[i])
            i += 1
        return out
            
    def add_special_cases(self, toks:Collection[str]): pass

char_tokenize_processor = TokenizeProcessor(tokenizer=Tokenizer(tok_func=LetterTokenizer), include_bos=False)

data_lm = (TextList
              .from_csv(path='../',csv_name='100k.csv', 
                           processor=[OpenFileProcessor(), char_tokenize_processor, NumericalizeProcessor()])
              .split_by_rand_pct(0.01)
              .label_for_lm()
              .databunch(bs=256,bptt=128))
idx text
0 I I S T V xxbos M E R L K K E E E E K L K E V E A E E E E E E E E E E E E E E E E I P L Q R N V R R T G E G E S S G T A E E E K L E K M V S
1 F D E K L I I Y G R D V Y L K D G R L I F E S I A D A D H I R S S I V N D G T K Q P I L E E L K G F T S S K S A F M T A T K E L S E A A V F
2 S D A E A K L K N G V H C L E V xxbos M L D G R R S K E N W G V K H N P T T D A I F I L A E T G T C M I H I A C H F D G E K L K L Q L T K V
3 E R T G R P L G A D N F I A W L E N V L G R M L H K Q K S A I S F G V A G A A C R G R xxbos M G P G S R V H T V Q T L V V G G G V V G L S A
4 P K V K R P E A G V T G R E M L L W D F K D M N Q E G L E N I W A A L D D V V M G G V S L S N I K L A E H G A T F S G E T S S R N S G G F C

Right now I don’t have any padding in my input.

I used pretty much standard stuff. I added xxbos and xxeos tokens but no special padding.
My tokenizer looks like this. Pretty simple compared to my previous version…

My problem was the sampling. Fastai keeps adding tokens until the max_size.

Ah
Beware when sampling! If you just use the predict method you might get horrible results. That’s because it wont stop adding tokens until the sentence reaches a predefined size. Try to modify it to include a stop token.

Hi @cdparks
Did you manage to improve your results?