Machine Learning for Biotherapeutics

arjanhada · July 23, 2024, 6:33pm

I will post case studies that I have developed for “ML for Biotherapeutics design” here.

antibody-binding-affinity

Predict the binding affinity (Kd) between single-chain variable fragments (scFv) variants and the SARS-CoV-2 peptide.

Notebook	Keywords
Machine learning guided binding affinity prediction from antibody libraries - 1	Protein Language Models, Transformers, Fine-tune, AbLang_heavy, antiberta2, antiberta2-cssp, ESM2_t33_650M_UR50D
Machine learning guided binding affinity prediction from antibody libraries - 2	Protein Language Models, Transformers, Fine-tune, ESM2_t33_650M_UR50D, Error analysis

Key Takeaways

I finetune antibody specific and general protein language model for predicting binding affinity (Kd) between single-chain variable fragments (scFv) and the target the SARS-CoV-2 peptide.

esm2_t33_650M_UR50D model demonstrated superior performance compared to antibody-specific language models, namely antiberta2-cssp, antiberta2, and ablang-H. Notably, while the performance of ablang-H lagged significantly behind, the remaining three models showed relatively comparable results. However, it is important to note, esm2_t33_650M_UR50D is larger, containing about 650 million parameters, whereas both antiberta2 and antiberta2-cssp have around 202 million parameters.

My approach exceeded the original study’s Spearman rho (0.64 vs ~0.50) on hold-out set using just a single model, as opposed to the original study’s ensemble of 16 models.

arjanhada · July 23, 2024, 6:37pm

esm2-antibody-CLIP

Multimodal training of ESM2 model for Antibody structural awareness.

Introduction

OpenAI’s CLIP model is quite impressive at connecting text with images, efficiently learning visual concepts from natural language supervision. Can we do the same for protein sequences and structure? Instead of text and images, we’re integrating protein sequences with their structural information.

In a series of four accessible notebooks we develop a multimodal training approach that integrates antibody sequence data with structural data using a contrastive learning framework inspired by OpenAI’s CLIP. We utilize the ESM2 model from Facebook’s Evolutionary Scale Modeling (ESM) suite as our base architecture. The model was fine-tuned with a custom head for contrastive learning, where the sequence and structural embeddings were projected into a common latent space. Our training process focused on minimizing the contrastive loss to ensure that sequences and their corresponding structures were closely aligned in this space. We called this model ESM2-Ab-CLIP.

We also evaluate the effect of ESM2-Ab-CLIP model on antibody binding affinity prediction task relative to the base ESM2 model. With ample data for fine-tuning, there were no additional benefits of multimodal training. However, the true strength of the ESM2AbCLIP model emerged under a low data regime. In this scenario, the ESM2AbCLIP model outperformed the base ESM2 model, on antibody binding affinity task as measured by spearmanr and Top 10% recall.

Notebook	Keywords
SAbDab_ProteinFlow.ipynb	Obtain antibody sequence and structure from Structural Antibody Database, perform MMSeqs2 clustering and generate Train/Valid/Test set.
ESM_IF1_embeddings.ipynb	Generate antibody structure embeddings using ESM-IF1 and convert PDB files to ESM-IF1 embeddings.
ESM2_Ab_CLIP.ipynb	Perform multimodal training of ESM2, learning a joint embedding space of antibody sequence and structure.
ESM2AbCLIP_bindaff_pred.ipynb	Evaluate the effect of multimodal training on antibody binding affinity prediction.

thomas642daniel · September 9, 2024, 5:17am

Hello,

I fine-tuned both antibody-specific and general protein language models to predict the binding affinity (Kd) between single-chain variable fragments (scFv) and the SARS-CoV-2 peptide. This involved customizing models to better fit the specific characteristics of antibody-peptide interactions.The ESM2_t33_650M_UR50D model outperformed other antibody-specific language models. Despite being larger, with approximately 650 million parameters, it achieved superior performance compared to the smaller models like antiberta2-cssp and antiberta2, which each have around 202 million parameters. The abLang-H model, despite being a strong contender, lagged significantly behind. My approach surpassed the original Manage Publix passport study’s Spearman correlation coefficient (rho) score of ~0.50, achieving a rho of 0.64 on the hold-out set using a single model. This is a notable improvement, considering the original study employed an ensemble of 16 models to achieve a similar performance.

Hope That helps!