Deep Learning for Genomics/Bioinformatics/Comp Bio

KarlH · March 28, 2019, 9:50pm

I wanted to make a general thread for deep learning applications to biological/genomic data. Biology is just starting to hit its big data period, driven largely by lower and lower sequencing costs. Applications of deep learning to biological analysis are still very new. When you look at recent publications, you see lots of one hot encoded vectors and hand engineered features. Techniques like pre-training and transfer learning are not widely used. Which is all to say I think there’s a lot of opportunity to improve on existing models.

If you are interested in the space, or come across interesting papers, datasets or problems, post about them here.

P_Wes · March 29, 2019, 12:14am

I’m definitely interested in this space. Hope this thread takes off!

KarlH · March 29, 2019, 1:48am

I’ve been looking into applying ULMFiT to genomics data for classification, and it works pretty well. I’ve been able to beat a number of published results. The method is a good fit for genomic data because classification corpuses tend to be small, but there are heaps and heaps of unlabeled sequence data available.

DrHB · March 29, 2019, 2:10am

wow! You have done a great job in organizing and comparing the results! great job

MicPie · March 29, 2019, 6:25am

I’m excited about this thread. Thanks for starting it!

Your repo looks very interesting. Do you plan to share your notebooks soon?

I also looked into NGS data and wanted to put together a simple CNN classifier for two different species. However, I was not able to find sequence read data that was suitable for this approach as they sequencing setups were too different and, therefore, I thought too easy to identify and not a good basis. (I also put together a repo of this unfinished project.)
If somebody knows a good source for NGS data please let me know.

I wanted to transfer this approach somehow to the PrecisionFDA CDRH Biothreat Challenge. They provided reference genomes of the species to detect and NGS reads of the samples to analyze.

I guess, RNN/transformer architectures with CNN input stages could be a very interesting tool for NGS data.

I’m looking forward to the discussion!

KarlH · March 29, 2019, 6:41am

All the relevant notebooks are in the repo, just look through the folders. This directory in particular has more of a “walk through” format.

For raw NGS data you can check out the NCBI Sequence Reads Archive.

Along those lines, if you haven’t read the DeepVariant paper you definitely should. It’s by Google/Verily. They use NGS alignment images as input to a standard CNN for SNP classification.

DrHB · March 29, 2019, 7:44pm

I spend more time in your GitHub Repo. You outperformed all the benchmarks. Have you considered turning this results in to a publications ?

radikubwa · March 29, 2019, 8:37pm

@deena-b @alenas

deena-b · March 29, 2019, 9:54pm

Check out Krakken software for differentiating between species

deena-b · March 29, 2019, 10:17pm

Hi Michael,
If you want any expert knowledge on Biothreats related to the flu virus feel free to reach out. I did my PhD on emerging zoonotic strains.

deena-b · March 29, 2019, 10:22pm

I’m currently interested in looking at how tracking mitochondrial evolution can help understand cancer with single cell sequencing of somatic cells.

Here’s the paper that describes the mitochondrial tracking:
https://www.ncbi.nlm.nih.gov/pubmed/?term=Lineage+Tracing+in+Humans+Enabled+by+Mitochondrial+Mutations+and+Single-Cell+Genomics

suvash · March 30, 2019, 11:10pm

Thanks Karl, for starting this thread. I’m very interested in this domain as well. Neat that you’ve shared your ULMFiT variant.

KarlH · March 31, 2019, 1:28am

I don’t know how to write papers, I just know how to post things on github

But in terms of performance there are a few more things I want to prove out before I distribute it more widely. There are a few datasets I’m working on right now that I haven’t quite cracked. CRISPR guide scoring has turned out to be more difficult than I expected. I’m working on a dataset where the authors published much better results than I have achieved, and I’m trying to figure out what the missing ingredient is.

There’s another dataset where I can achieve a lower validation loss compared to the authors, but their accuracy/sensitivity/specificity are higher than mine for the test set. Not yet sure what’s going on there.

The other thing I want to code up is the ability to take in large genomic sequences or raw NGS data to make something that feels more practical and useful. The datasets I’ve used so far have been ones used by other publications, which is nice because it allows or a direct performance comparison between methods. But they feel a little sterile to me.

DrHB · March 31, 2019, 1:40am

can you share links to publications ?

KarlH · March 31, 2019, 1:55am

This is the CRISPR paper

and associated github repo

This is the other paper I mentioned

and github repo

MicPie · March 31, 2019, 5:43am

There are ways to write entire publications on GitHub!

See for example the Deep Review repo that has outlined a nice setup for such a project: https://github.com/greenelab/deep-review

Maybe, this could be a nice fastai community project?!

That sounds very interesting! I already transferred my CDRH biothreat sequence data on my DL machine to have a look at it again with ULMFiT.

ThomM · April 1, 2019, 6:21pm

I’ve played around with the raw Oxford Nanopore data a little bit. If you haven’t already seen them, there are a couple of tools that take raw NGS data, in addition to the tools (Albacore) that Oxford Nanopore provides (which I believe uses an RNN under the hood).

DeepBinner is a tool that de-multiplexes barcoded ONT runs using a CNN to classify the reads. It’s written in Keras, and has a published model with weights (and unusually good documentation).

Chiron is a neural net basecalling tool which achieves roughly the same accuracy as Albacore (I think albacore changed to a Chiron-style architecture recently). It’s particularly interesting because it uses CTC layers to do sequence-to-sequence learning i.e. not pre-segmenting the squiggle data into chunks. I think this is a very promising approach & something I want to read more about.

Not a deep learning tool, but SquiggleKit is a handy package for querying & manipulating the signal-level data, which might be a useful reference if you’re building your own stuff.

I’m also very interested in working with the raw signal coming off the NGS devices. It seems likely that there’s all kinds of information inherent in the signal that gets lost when translating to a fastq. It also seems that wide 1D CNN or LSTM networks would have a better chance of picking up surrounding context signal in the raw “squiggle” form than in a basecalled form. That’s just a hunch, though.

ThomM · April 1, 2019, 6:23pm

That Deep Review paper/repo is really interesting! I love how they have it set up, and how evangelical they are about their methodology. I contacted them a while ago and they connected me with the specific people who contributed to the particular section of interest. A good bunch of folks with a cool approach.

KarlH · April 2, 2019, 3:36am

I figured the second one out. The test set has some long (~15000 bp) sequences. The max_len parameter of my model was too low, so most of the sequence (along with anything batched with it) got cut off. Upping max_len and lowering batch size for inference fixed things. Results are now posted.

axelstram · April 2, 2019, 4:30am

I always found this topic to be very interesting, but my lack of knowledge in biology kept me out from this subject. Do you know any good introduction to genomics for non biologists?