ULMFiT for Proteomics

(Nils) #1

I am very excited to share our recent work on protein classification that appeared on bioRxiv today.

We pretrain a language model on Swiss-Prot, a large protein database, and finetune a classifier on three different protein classification tasks. It turns out that the model is quite competitive compared to state-of-the-art models that make use of precomputed PSSM features from expensive database similarity searches. There seem to be a lot of possible applications for NLP-methods in the domain of proteomics.

Happy to hear your thoughts on this…


(Thundering Typhoons) #2

Is it possible to share the pretrained embeddings?


(Nils) #3

Thanks for your interest. We will try to release the code and some pretrained models as soon as possible.


(Nils) #4

Sorry for letting you wait so long. An updated version of our preprint is now available on bioRxiv. We also set up a GitHub repository with source code and links to pretrained models. Happy finetuning!


(Nils) #5

We recently applied the same framework to a peptide (i.e. protein fragments) regression task, namely MHC binding affinity prediction. Suprisingly even a single model (1-layer LSTM) with a standard training procedure (1-cycle) reaches state-of-the-art performance and ensembling or LM pretraining only marginally improve the result.

If one of you is interested, have a look at our preprint or the corresponding github repository building on fast.ai.

Comments of any kind are of course very much appreciated…