ULMFiT for Proteomics

nstrodt · July 18, 2019, 7:47am

I am very excited to share our recent work on protein classification that appeared on bioRxiv today.

We pretrain a language model on Swiss-Prot, a large protein database, and finetune a classifier on three different protein classification tasks. It turns out that the model is quite competitive compared to state-of-the-art models that make use of precomputed PSSM features from expensive database similarity searches. There seem to be a lot of possible applications for NLP-methods in the domain of proteomics.

Happy to hear your thoughts on this…

thunderingtyphoons · August 13, 2019, 5:57pm

Is it possible to share the pretrained embeddings?

nstrodt · August 14, 2019, 6:26pm

Thanks for your interest. We will try to release the code and some pretrained models as soon as possible.

nstrodt · September 4, 2019, 9:16am

Sorry for letting you wait so long. An updated version of our preprint is now available on bioRxiv. We also set up a GitHub repository with source code and links to pretrained models. Happy finetuning!

nstrodt · October 28, 2019, 1:00pm

We recently applied the same framework to a peptide (i.e. protein fragments) regression task, namely MHC binding affinity prediction. Suprisingly even a single model (1-layer LSTM) with a standard training procedure (1-cycle) reaches state-of-the-art performance and ensembling or LM pretraining only marginally improve the result.

If one of you is interested, have a look at our preprint or the corresponding github repository building on fast.ai.

Comments of any kind are of course very much appreciated…