I wanted to make a general thread for deep learning applications to biological/genomic data. Biology is just starting to hit its big data period, driven largely by lower and lower sequencing costs. Applications of deep learning to biological analysis are still very new. When you look at recent publications, you see lots of one hot encoded vectors and hand engineered features. Techniques like pre-training and transfer learning are not widely used. Which is all to say I think there’s a lot of opportunity to improve on existing models.
If you are interested in the space, or come across interesting papers, datasets or problems, post about them here.
I’ve been looking into applying ULMFiT to genomics data for classification, and it works pretty well. I’ve been able to beat a number of published results. The method is a good fit for genomic data because classification corpuses tend to be small, but there are heaps and heaps of unlabeled sequence data available.
I’m excited about this thread. Thanks for starting it!
Your repo looks very interesting. Do you plan to share your notebooks soon?
I also looked into NGS data and wanted to put together a simple CNN classifier for two different species. However, I was not able to find sequence read data that was suitable for this approach as they sequencing setups were too different and, therefore, I thought too easy to identify and not a good basis. (I also put together a repo of this unfinished project.)
If somebody knows a good source for NGS data please let me know.
I wanted to transfer this approach somehow to the PrecisionFDA CDRH Biothreat Challenge. They provided reference genomes of the species to detect and NGS reads of the samples to analyze.
I guess, RNN/transformer architectures with CNN input stages could be a very interesting tool for NGS data.
Along those lines, if you haven’t read the DeepVariant paper you definitely should. It’s by Google/Verily. They use NGS alignment images as input to a standard CNN for SNP classification.
I don’t know how to write papers, I just know how to post things on github
But in terms of performance there are a few more things I want to prove out before I distribute it more widely. There are a few datasets I’m working on right now that I haven’t quite cracked. CRISPR guide scoring has turned out to be more difficult than I expected. I’m working on a dataset where the authors published much better results than I have achieved, and I’m trying to figure out what the missing ingredient is.
There’s another dataset where I can achieve a lower validation loss compared to the authors, but their accuracy/sensitivity/specificity are higher than mine for the test set. Not yet sure what’s going on there.
The other thing I want to code up is the ability to take in large genomic sequences or raw NGS data to make something that feels more practical and useful. The datasets I’ve used so far have been ones used by other publications, which is nice because it allows or a direct performance comparison between methods. But they feel a little sterile to me.
I’ve played around with the raw Oxford Nanopore data a little bit. If you haven’t already seen them, there are a couple of tools that take raw NGS data, in addition to the tools (Albacore) that Oxford Nanopore provides (which I believe uses an RNN under the hood).
DeepBinner is a tool that de-multiplexes barcoded ONT runs using a CNN to classify the reads. It’s written in Keras, and has a published model with weights (and unusually good documentation).
Chiron is a neural net basecalling tool which achieves roughly the same accuracy as Albacore (I think albacore changed to a Chiron-style architecture recently). It’s particularly interesting because it uses CTC layers to do sequence-to-sequence learning i.e. not pre-segmenting the squiggle data into chunks. I think this is a very promising approach & something I want to read more about.
Not a deep learning tool, but SquiggleKit is a handy package for querying & manipulating the signal-level data, which might be a useful reference if you’re building your own stuff.
I’m also very interested in working with the raw signal coming off the NGS devices. It seems likely that there’s all kinds of information inherent in the signal that gets lost when translating to a fastq. It also seems that wide 1D CNN or LSTM networks would have a better chance of picking up surrounding context signal in the raw “squiggle” form than in a basecalled form. That’s just a hunch, though.
That Deep Review paper/repo is really interesting! I love how they have it set up, and how evangelical they are about their methodology. I contacted them a while ago and they connected me with the specific people who contributed to the particular section of interest. A good bunch of folks with a cool approach.
I figured the second one out. The test set has some long (~15000 bp) sequences. The max_len parameter of my model was too low, so most of the sequence (along with anything batched with it) got cut off. Upping max_len and lowering batch size for inference fixed things. Results are now posted.
I always found this topic to be very interesting, but my lack of knowledge in biology kept me out from this subject. Do you know any good introduction to genomics for non biologists?