Deep Learning for Genomics/Bioinformatics/Comp Bio

Data science at a molecular diagnostic company here. Glad to see this post.

One thing I would like to share this this collaborate review of deep learning called deep review which @MicPie mentioned. Particularly I am interested in this DANN score for potentially linking multiple SNP to a trait as linear weights, and here is the link to the paper:
https://academic.oup.com/bioinformatics/article/31/5/761/2748191

I am super curious if anyone in the fastai community has taken a look at this paper and thought about how fastai can fit into the picture.

Shameless plugmy linkedin

Thanks for making this thread!!

This specialization gives a good introduction to get you up-to speed genomic data science specialization and maybe alongside this book Python for biologists. Iā€™ve already used these definitely worth checking out.

3 Likes

Gave this a quick once over. The first thing that came to mind was Illuminaā€™s PrimateAI for classifying the pathogenicity of missense mutations.

On first glance, the publication Ching et at., bioRxiv, 2017 that the Green Labā€™s github led to seems worth reading.

This ā€˜perspectiveā€™ paper Zou et al., Nature Genetics, 2019, by authors out of Stanford, CZ Biohub, Scripps, etc. might also be a nice introduction to Deep Learning in Genomics. @axelstram

Iā€™m happy to advise on writing scientific papers, but I have a lot to learn in relation to writing deep learning papers.

An outline for a scientific paper is generally:

Introduction

  • Start with a few lines that explain what motivated your work - what the field was missing
  • Shortly summarize the work that came before yours and made your work possible. Include anything that stimulated your ideas
  • In the final paragraph write what you aimed to do and what you found

Results & Discussion

  • Explain the steps you took to generate some results, state the results, and relate your results to the results that others had obtained
  • If others havenā€™t shown results like yours, state something like ā€œthis is the first instance we know of where this type of results have been sharedā€
  • Repeat for each logical step, including your final result.

Conclusion

  • The first line should summarize what you found in a broader sense than what you wrote in the introduction
  • State why what you did is important in a broad sense
  • Relate what you did to what others have done
  • Hypothesize about what could come next

Methods

  • Walk people through the more minute details of how you did stuff

Generally, scientists will write their methods section first, then the results/discussion section, followed by the introduction, and finish with the discussion.

Hereā€™s a description of how to write a scientific paper from Nature.

4 Likes

Hey Thom, I think youā€™re hunch is right in that there is some data that is determined not to be of good enough quality to pass a threshold because it is unclear how to determine where the read should actually ā€˜alignā€™ or which part of that squiggle is correct. Often the scientific way of looking into these areas would be to use a different sequencing method that targets the questionable area. This means that companies would have train/test data to use to build a model :wink:

I will answer my own question with what I have found:

Molecular Biology for Computer Scientists (book chapter)

Bioinformatics Algorithms: An Active Learning Approach (very hands-on book that also has a (very) long Cousera specialization associated with it)

3 Likes

Also check out Biostars book & website

2 Likes

Hey all,
Quick question: should we move this conversation into the open or keep it here, locked up within part 2 until the end of the class?

Below are some other fast.ai boards with similar types of topics:

Deep Learning

Part 1 2019

Initially I thought we should move it out, but now that Iā€™ve had a look around, I donā€™t think there is much related activity outside of part 2. So, I think we can wait until the end of these classes.

4 Likes

Hey all,
Iā€™m trying to read the paper ā€œUniversal Language Model Fine-tuning for Text Classificationā€ that Jeremy and Sebastian Ruder published last year (2018).

In trying to understand the difference between transductive and inductive transfer learning I googled and found a Quora answer by Waleed Kadous (AI PhD) that was easy to understand. Itā€™s pasted it below:

Imagine you have a training data, but only a subset of it has labels.

For example, say you are trying to classify whether an image has a flower in it or not. You have 100,000 images, but you only have 1,000 images that you know definitively contain a flower; and another 1,000 that you know donā€™t contain a flower. The other 98,000 you have no idea about ā€“ maybe they have flowers, maybe they donā€™t.

Inductive learning works by looking at the 2,000 labeled examples and building a classifier on this. Transductive learning (also known as semi-supervised learning) says "Wait: maybe the other 98,000 images donā€™t have labels, but they tell me something about the problem space. Maybe I can still use them to help improve my accuracy. "

Bonus: Thereā€™s one more really interesting type of learning, which is active learning. That is when you look through the 98,000 examples and can select a subset and request labels from an oracle. So the algorithm might say ā€œOK, of those 98,000, can you label this set of 50 images for me? That will help me build a better classifier.ā€

So my understanding is:

  • Inductive transfer learning only uses labeled data

  • Transductive transfer learning uses labeled and unlabeled data to get an idea of the scope of the data that the model will attempt to classify

We might be discussing unpublished research shared by Jeremy and this might be another good reason to keep it in the private forum.

1 Like

According to your Quora quote, transductive learning is another name for semi-supervised learning.

A quick search using term ā€œsemi-supervisedā€ instead of word transductive yielded this article https://arxiv.org/abs/1812.05313 that might shed light on your question

1 Like

Check out this Commonwealth Club podcast on ā€œDeep Medicineā€, a new book by Eric Topol, M.D (Scripps)

Link to Commonwealth Club iTunes page (Deep Medicine release date: 4/2)

1 Like

Eric Topol and I both gave talks at Scripps recently. I called my ā€œEven Deeper Medicineā€ :slight_smile:

14 Likes

I just signed up for this web site. Thank you for reference

I am interested in sharing my own medical data for research. How should I go about getting it from my providers and how I should share it?

How could I bulk download genome data from the web? I am trying to replicate @KarlH notebooks on my computer and I downloaded a few dozen of sequences for bacteria from NCBI database and it took an hour or so of tedious clicking. Is there a better way?

Downloads go a lot faster with wget or curl

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

1 Like

I downloaded .fna files on entire collection of bacteria using regular browser. It is a 15GB download. It did not take too long. I am training language model on 500 genomes now. One epoch on my GPU takes 5 hours. I will keep you posted on the result

1 Like

Training on 1000 genomes now. Accuracy improved to 0.19 breaking through the previous barrier of 0.18. It is a good sign because so far I went only through less than 10% of training data. It is painfully slow, but training is going in the right direction. Accuracy improves and the model is not overfitting yet.
If any of you know how to interpret the accuracy metric in context of language model, please explain it to me. I only understand accuracy in context of classification and language model is not classification, or is it?

3 Likes