Create Language Model for Chemical Structures

cdparks · April 22, 2020, 4:32pm

Yes, I am now finding that as well. It appears that FastAI doesn’t use padding when creating the batches for the initial language model training phase. For some reason, the padding embedding still chances during training though. I just posted a question about that:

Where handling padding well really matters is during the classification/regression phase, as the model needs to mask the padding token when featurizing the text. This happens in the masked_concat_pool function. I was getting really bad metrics when trying to use my pre-trained LSTM to regress pIC50s. I am currently trying to figure out if there was some issue with the padding token.