ULMFiT for Proteomics

I am very excited to share our recent work on protein classification that appeared on bioRxiv today.

We pretrain a language model on Swiss-Prot, a large protein database, and finetune a classifier on three different protein classification tasks. It turns out that the model is quite competitive compared to state-of-the-art models that make use of precomputed PSSM features from expensive database similarity searches. There seem to be a lot of possible applications for NLP-methods in the domain of proteomics.

Happy to hear your thoughts on this…


Is it possible to share the pretrained embeddings?

Thanks for your interest. We will try to release the code and some pretrained models as soon as possible.

Sorry for letting you wait so long. An updated version of our preprint is now available on bioRxiv. We also set up a GitHub repository with source code and links to pretrained models. Happy finetuning!

We recently applied the same framework to a peptide (i.e. protein fragments) regression task, namely MHC binding affinity prediction. Suprisingly even a single model (1-layer LSTM) with a standard training procedure (1-cycle) reaches state-of-the-art performance and ensembling or LM pretraining only marginally improve the result.

If one of you is interested, have a look at our preprint or the corresponding github repository building on fast.ai.

Comments of any kind are of course very much appreciated…