I am very excited to share our recent work on protein classification that appeared on bioRxiv today.
We pretrain a language model on Swiss-Prot, a large protein database, and finetune a classifier on three different protein classification tasks. It turns out that the model is quite competitive compared to state-of-the-art models that make use of precomputed PSSM features from expensive database similarity searches. There seem to be a lot of possible applications for NLP-methods in the domain of proteomics.
Sorry for letting you wait so long. An updated version of our preprint is now available on bioRxiv. We also set up a GitHub repository with source code and links to pretrained models. Happy finetuning!
We recently applied the same framework to a peptide (i.e. protein fragments) regression task, namely MHC binding affinity prediction. Suprisingly even a single model (1-layer LSTM) with a standard training procedure (1-cycle) reaches state-of-the-art performance and ensembling or LM pretraining only marginally improve the result.
If one of you is interested, have a look at our preprint or the corresponding github repository building on fast.ai.
Comments of any kind are of course very much appreciated…