Having trouble getting good results with molecular data

enthusiasmcurbed · February 25, 2019, 11:22am

I am attempting to use Fastai’s tabular learner to work with molecular data. The data is a collection of molecular fingerprints, which are commonly used for ML and are essentially just how we convert molecules into binary vectors. Thus, each training sample has a feature vector like so: [0,1,0,0,…,1,0,1,0,0,0], where an on bit represents the presence of a certain substructure. The target is a pka (dissociation constant), and is just a float, making this a simple regression task.

I have made all of my testing with fastai available and reproducible in this google Colab notebook, so any person on the forum can see the results and work through them.

In particular, I am having trouble getting any decent results even though gradient boosted regression trees have done alright (CV r2_score ~0.76). Using fastai’s tabular learner, I cannot even get above R^2 =0. Using @jeremy 's excellent pytorch tutorial, I obtained similarly bad results. I recognise that this is a small dataset (~3000 training samples), but I was able to get up to R^2 of 0.8 using simple feedforward neural networks in Deepchem, Therefore, I am confident I must really be messing something up or misusing the tabular learner.

Any help would be greatly appreciated, because I would love to be able to switch over to fastai/pytorch for this task (I love the syntax and the community). Furthermore, it would be great to get this working for other people in the Chemistry community.

Thanks so much,
Matt

Bill · March 19, 2019, 4:09pm

Hi Matt,
Have you tried ml models like kernel ridge regression?
The following article could be found helpful:

Best,
Bill

marcossantana · March 2, 2021, 3:28pm

Try passing the fingerprints as categorical variables. They will be converted to embeddings inside the learner, which will greatly reduce the dimensionality. By using the fingerprints as is, you are basically showing your model more than 1,000 features. That’s why it is overfitting.