Deep Learning for Genomics/Bioinformatics/Comp Bio

Gave this a quick once over. The first thing that came to mind was Illumina’s PrimateAI for classifying the pathogenicity of missense mutations.

On first glance, the publication Ching et at., bioRxiv, 2017 that the Green Lab’s github led to seems worth reading.

This ‘perspective’ paper Zou et al., Nature Genetics, 2019, by authors out of Stanford, CZ Biohub, Scripps, etc. might also be a nice introduction to Deep Learning in Genomics. @axelstram

I’m happy to advise on writing scientific papers, but I have a lot to learn in relation to writing deep learning papers.

An outline for a scientific paper is generally:

Introduction

  • Start with a few lines that explain what motivated your work - what the field was missing
  • Shortly summarize the work that came before yours and made your work possible. Include anything that stimulated your ideas
  • In the final paragraph write what you aimed to do and what you found

Results & Discussion

  • Explain the steps you took to generate some results, state the results, and relate your results to the results that others had obtained
  • If others haven’t shown results like yours, state something like “this is the first instance we know of where this type of results have been shared”
  • Repeat for each logical step, including your final result.

Conclusion

  • The first line should summarize what you found in a broader sense than what you wrote in the introduction
  • State why what you did is important in a broad sense
  • Relate what you did to what others have done
  • Hypothesize about what could come next

Methods

  • Walk people through the more minute details of how you did stuff

Generally, scientists will write their methods section first, then the results/discussion section, followed by the introduction, and finish with the discussion.

Here’s a description of how to write a scientific paper from Nature.

4 Likes

Hey Thom, I think you’re hunch is right in that there is some data that is determined not to be of good enough quality to pass a threshold because it is unclear how to determine where the read should actually ‘align’ or which part of that squiggle is correct. Often the scientific way of looking into these areas would be to use a different sequencing method that targets the questionable area. This means that companies would have train/test data to use to build a model :wink:

I will answer my own question with what I have found:

Molecular Biology for Computer Scientists (book chapter)

Bioinformatics Algorithms: An Active Learning Approach (very hands-on book that also has a (very) long Cousera specialization associated with it)

3 Likes

Also check out Biostars book & website

2 Likes

Hey all,
Quick question: should we move this conversation into the open or keep it here, locked up within part 2 until the end of the class?

Below are some other fast.ai boards with similar types of topics:

Deep Learning

Part 1 2019

Initially I thought we should move it out, but now that I’ve had a look around, I don’t think there is much related activity outside of part 2. So, I think we can wait until the end of these classes.

4 Likes

Hey all,
I’m trying to read the paper “Universal Language Model Fine-tuning for Text Classification” that Jeremy and Sebastian Ruder published last year (2018).

In trying to understand the difference between transductive and inductive transfer learning I googled and found a Quora answer by Waleed Kadous (AI PhD) that was easy to understand. It’s pasted it below:

Imagine you have a training data, but only a subset of it has labels.

For example, say you are trying to classify whether an image has a flower in it or not. You have 100,000 images, but you only have 1,000 images that you know definitively contain a flower; and another 1,000 that you know don’t contain a flower. The other 98,000 you have no idea about – maybe they have flowers, maybe they don’t.

Inductive learning works by looking at the 2,000 labeled examples and building a classifier on this. Transductive learning (also known as semi-supervised learning) says "Wait: maybe the other 98,000 images don’t have labels, but they tell me something about the problem space. Maybe I can still use them to help improve my accuracy. "

Bonus: There’s one more really interesting type of learning, which is active learning. That is when you look through the 98,000 examples and can select a subset and request labels from an oracle. So the algorithm might say “OK, of those 98,000, can you label this set of 50 images for me? That will help me build a better classifier.”

So my understanding is:

  • Inductive transfer learning only uses labeled data

  • Transductive transfer learning uses labeled and unlabeled data to get an idea of the scope of the data that the model will attempt to classify

We might be discussing unpublished research shared by Jeremy and this might be another good reason to keep it in the private forum.

1 Like

According to your Quora quote, transductive learning is another name for semi-supervised learning.

A quick search using term “semi-supervised” instead of word transductive yielded this article https://arxiv.org/abs/1812.05313 that might shed light on your question

1 Like

Check out this Commonwealth Club podcast on “Deep Medicine”, a new book by Eric Topol, M.D (Scripps)

Link to Commonwealth Club iTunes page (Deep Medicine release date: 4/2)

1 Like

Eric Topol and I both gave talks at Scripps recently. I called my “Even Deeper Medicine” :slight_smile:

14 Likes

I just signed up for this web site. Thank you for reference

I am interested in sharing my own medical data for research. How should I go about getting it from my providers and how I should share it?

How could I bulk download genome data from the web? I am trying to replicate @KarlH notebooks on my computer and I downloaded a few dozen of sequences for bacteria from NCBI database and it took an hour or so of tedious clicking. Is there a better way?

Downloads go a lot faster with wget or curl

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

1 Like

I downloaded .fna files on entire collection of bacteria using regular browser. It is a 15GB download. It did not take too long. I am training language model on 500 genomes now. One epoch on my GPU takes 5 hours. I will keep you posted on the result

1 Like

Training on 1000 genomes now. Accuracy improved to 0.19 breaking through the previous barrier of 0.18. It is a good sign because so far I went only through less than 10% of training data. It is painfully slow, but training is going in the right direction. Accuracy improves and the model is not overfitting yet.
If any of you know how to interpret the accuracy metric in context of language model, please explain it to me. I only understand accuracy in context of classification and language model is not classification, or is it?

3 Likes

I don’t know how to get your data from your provider, but you could join “All of Us” a U.S. wide program to sequence 1 million genomes. UCSF is the local facilitator.

I already gave them some samples, it took about 30 minutes and a few online forms. My theory is: better the NIH has my genome than a private company :slight_smile:

3 Likes

This is a good paper about how to do hypothesis driven and controlled AI experiments by Assistant Professor Michael Keiser at UCSF: “Adversarial Controls for Scientific Machine Learning”.

He talks about generating random data to put through an ML model to see if you get similar results to what you get when you put in features that you think are important. I’m not sure how this will transfer to DL models, since we are not supposed to ‘curate’ the inputs.

1 Like

Interesting paper this morning, “ Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts”

Honestly I don’t really understand the actual problem they’re solving ie why it’s important to improve prediction of chromatin accessibility, but that’s a bio question not a DL question. But the paper is quite clear in its technical implementation & model design. They use a fairly shallow ResNet architecture with 1d convolutions, they do their own kind of transfer learning (I’ll have to re-read to understand what they’re transferring between, it seems like “curriculum learning” to me but I’m not sure). It’s done in Pytorch.

Code repo (including links to the data!) here

Good twitter thread from co-lead-author with some more context here

Edit to add, I think it could be interesting to replicate in fastai, using as many tricks as we know, but at 30gb of data I assume it would take a while to train!

2 Likes