Deep Learning for Genomics/Bioinformatics/Comp Bio

alexli · April 3, 2019, 10:13pm

Data science at a molecular diagnostic company here. Glad to see this post.

One thing I would like to share this this collaborate review of deep learning called deep review which @MicPie mentioned. Particularly I am interested in this DANN score for potentially linking multiple SNP to a trait as linear weights, and here is the link to the paper:
https://academic.oup.com/bioinformatics/article/31/5/761/2748191

I am super curious if anyone in the fastai community has taken a look at this paper and thought about how fastai can fit into the picture.

Shameless plugmy linkedin

cjung5k · April 3, 2019, 10:27pm

Thanks for making this thread!!

radikubwa · April 3, 2019, 10:56pm

This specialization gives a good introduction to get you up-to speed genomic data science specialization and maybe alongside this book Python for biologists. I’ve already used these definitely worth checking out.

KarlH · April 3, 2019, 11:02pm

Gave this a quick once over. The first thing that came to mind was Illumina’s PrimateAI for classifying the pathogenicity of missense mutations.

deena-b · April 3, 2019, 11:07pm

On first glance, the publication Ching et at., bioRxiv, 2017 that the Green Lab’s github led to seems worth reading.

This ‘perspective’ paper Zou et al., Nature Genetics, 2019, by authors out of Stanford, CZ Biohub, Scripps, etc. might also be a nice introduction to Deep Learning in Genomics. @axelstram

I’m happy to advise on writing scientific papers, but I have a lot to learn in relation to writing deep learning papers.

An outline for a scientific paper is generally:

Introduction

Start with a few lines that explain what motivated your work - what the field was missing
Shortly summarize the work that came before yours and made your work possible. Include anything that stimulated your ideas
In the final paragraph write what you aimed to do and what you found

Results & Discussion

Explain the steps you took to generate some results, state the results, and relate your results to the results that others had obtained
If others haven’t shown results like yours, state something like “this is the first instance we know of where this type of results have been shared”
Repeat for each logical step, including your final result.

Conclusion

The first line should summarize what you found in a broader sense than what you wrote in the introduction
State why what you did is important in a broad sense
Relate what you did to what others have done
Hypothesize about what could come next

Methods

Walk people through the more minute details of how you did stuff

Generally, scientists will write their methods section first, then the results/discussion section, followed by the introduction, and finish with the discussion.

Here’s a description of how to write a scientific paper from Nature.

deena-b · April 3, 2019, 11:32pm

Hey Thom, I think you’re hunch is right in that there is some data that is determined not to be of good enough quality to pass a threshold because it is unclear how to determine where the read should actually ‘align’ or which part of that squiggle is correct. Often the scientific way of looking into these areas would be to use a different sequencing method that targets the questionable area. This means that companies would have train/test data to use to build a model

axelstram · April 4, 2019, 3:00am

I will answer my own question with what I have found:

Molecular Biology for Computer Scientists (book chapter)

Bioinformatics Algorithms: An Active Learning Approach (very hands-on book that also has a (very) long Cousera specialization associated with it)

deena-b · April 4, 2019, 4:25pm

Also check out Biostars book & website

deena-b · April 4, 2019, 4:56pm

Hey all,
Quick question: should we move this conversation into the open or keep it here, locked up within part 2 until the end of the class?

Below are some other fast.ai boards with similar types of topics:

Deep Learning

Part 1 2019

Initially I thought we should move it out, but now that I’ve had a look around, I don’t think there is much related activity outside of part 2. So, I think we can wait until the end of these classes.

deena-b · April 5, 2019, 1:25am

Hey all,
I’m trying to read the paper “Universal Language Model Fine-tuning for Text Classification” that Jeremy and Sebastian Ruder published last year (2018).

In trying to understand the difference between transductive and inductive transfer learning I googled and found a Quora answer by Waleed Kadous (AI PhD) that was easy to understand. It’s pasted it below:

Imagine you have a training data, but only a subset of it has labels.

For example, say you are trying to classify whether an image has a flower in it or not. You have 100,000 images, but you only have 1,000 images that you know definitively contain a flower; and another 1,000 that you know don’t contain a flower. The other 98,000 you have no idea about – maybe they have flowers, maybe they don’t.

Inductive learning works by looking at the 2,000 labeled examples and building a classifier on this. Transductive learning (also known as semi-supervised learning) says "Wait: maybe the other 98,000 images don’t have labels, but they tell me something about the problem space. Maybe I can still use them to help improve my accuracy. "

Bonus: There’s one more really interesting type of learning, which is active learning. That is when you look through the 98,000 examples and can select a subset and request labels from an oracle. So the algorithm might say “OK, of those 98,000, can you label this set of 50 images for me? That will help me build a better classifier.”

So my understanding is:

Inductive transfer learning only uses labeled data
Transductive transfer learning uses labeled and unlabeled data to get an idea of the scope of the data that the model will attempt to classify

sergeman · April 5, 2019, 1:27am

We might be discussing unpublished research shared by Jeremy and this might be another good reason to keep it in the private forum.

sergeman · April 5, 2019, 2:25am

According to your Quora quote, transductive learning is another name for semi-supervised learning.

A quick search using term “semi-supervised” instead of word transductive yielded this article https://arxiv.org/abs/1812.05313 that might shed light on your question

deena-b · April 5, 2019, 3:58am

Check out this Commonwealth Club podcast on “Deep Medicine”, a new book by Eric Topol, M.D (Scripps)

Link to Commonwealth Club iTunes page (Deep Medicine release date: 4/2)

jeremy · April 5, 2019, 2:07pm

Eric Topol and I both gave talks at Scripps recently. I called my “Even Deeper Medicine”

sergeman · April 5, 2019, 3:49pm

I just signed up for this web site. Thank you for reference

sergeman · April 5, 2019, 11:17pm

I am interested in sharing my own medical data for research. How should I go about getting it from my providers and how I should share it?

sergeman · April 6, 2019, 9:15pm

How could I bulk download genome data from the web? I am trying to replicate @KarlH notebooks on my computer and I downloaded a few dozen of sequences for bacteria from NCBI database and it took an hour or so of tedious clicking. Is there a better way?

KarlH · April 6, 2019, 9:27pm

Downloads go a lot faster with wget or curl

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

sergeman · April 7, 2019, 3:06pm

I downloaded .fna files on entire collection of bacteria using regular browser. It is a 15GB download. It did not take too long. I am training language model on 500 genomes now. One epoch on my GPU takes 5 hours. I will keep you posted on the result

sergeman · April 11, 2019, 6:32pm

Training on 1000 genomes now. Accuracy improved to 0.19 breaking through the previous barrier of 0.18. It is a good sign because so far I went only through less than 10% of training data. It is painfully slow, but training is going in the right direction. Accuracy improves and the model is not overfitting yet.
If any of you know how to interpret the accuracy metric in context of language model, please explain it to me. I only understand accuracy in context of classification and language model is not classification, or is it?