DL and Genomics


(Akhil Rajput) #1

Hi,

Could DL be employed to analyze Genomics data, the data which is much much wider (1 Million to 30 Million columns) than long (10 K to 100 K rows)? Any suggestions, resources, examples will be highly appreciated.

Thanks


(Jeremy Howard) #2

Here’s a recent review paper:

https://arxiv.org/abs/1802.00810


(Akhil Rajput) #3

Awesome, I will go through the suggested review. Thanks


(Cedric Chee) #4

Hi. I have worked in the area on how to infer genomics information from MRI images and using it to predict cancer growth in patients in the Asia context. I am also currently self learning bioinformatics and applying it in genomics data in my side-projects.

I learned about deep learning in biology and medicine landscape through this paper “Opportunities and obstacles for deep learning in biology and medicine”. I think it is quite useful.

I also followed the work of:

RNN/NLP in genomics:

  • Finding functional regions (e.g., regulatory regions)
  • DeepGo (protein function classification from sequence)

If you are looking for the cutting edge model zoo for regulatory genomics including Google DeepVariant, check out:

This is not meant to be exhaustive. Almost impossible as there’s so much out there now. Here’s one giant list of deep learning implementations in biology. :roll_eyes:


(Akhil Rajput) #6

Thanks for the fantastic information.

What kind of DL - Genomics projects you are doing / have done? Any insight will be really helpful.

Thanks a lot


(Ziga Avsec) #7

@caspase8 abolutely. What use-cases do you have in mind?

In genomics, there are generally two complementary approaches of trying to understand the relationship between the genotype (DNA sequence) to phenotype (say disease status). Ideally we would like to have a function: phenotype = f(genotype, environment) which could predict the outcome (say health) from your genetic background and your lifestyle (environment). Having such function would be immensely important, as you could start asking questions like what mutations increase the risk of having the disease. Knowing the problematic mutations helps to pinpoint the mechanism of the diseases as you know the exact origin.

1. Genome-wide association studies (GWAS) (top-down)

The classical way is to sequence many people (say 500k - http://www.ukbiobank.ac.uk/) and then try to find statistical associations between individual mutations and the phenotype. What you practically end up with is a binary matrix of mutations X of shape (1e5, 1e7) (first axis represents the number of people and the second axis represents each mutation) and binary vector y of shape (1e5) denoting the phenotype of interest (say 1=disease, 0=healthy).

Then you do 1e7 statistical tests: test(y, X[:, i]) asking the questions whether a particular variant is associated with the phenotype. There are several problems with this approach:

  1. one ignores all the knowledge about the the genome, its organization and the function.
  2. since one has to correct for multiple testing, the statistical power of detecting a significant association is low.
  3. “Correlation does not imply causation”. Even if we observe a statistical association, this doesn’t mean that the link is causal (e.g. the mutation might just be correlated in the population with the true causal variant). Genetic variants (other word for mutation) are co-inherited (phenomena called linkage-disequilibrium, columns of X are not all independent of each other). There have been some interesting attempts recently to tackle the correlation vs causation problem (Implicit Causal Models for Genome-wide Association Studies). However, it will require much more work to be done in this direction to show it’s utility in practice.

2. Functional genomics (bottom-up)

The goal of functional genomics is to understand the impact of mutations to molecular phenotypes. Molecular phenotypes are for example the amount of certain proteins in the cell. The advantage of this approach is that one can measure thousands or millions of molecular phenotypes in the cell for each person (say the abundance of 20000 proteins). Each molecular phenotype is typically related only to a known subpart of the genome (say 1000bp of DNA sequence). Hence, one can build a model (typically a CNN) to predict those molecular phenotypes from the DNA sequence itself. Mutations are then scored as the difference in predictions if the sequence contains a mutation of interest or not (see the DeepSEA paper). We can then use this predicted difference as features and associate those to the phenotype of interest.

The advantage is twofold:

  1. More data: For a single sample (say person) we fit a model from 20k sequences of length 1kb instead of a single sequence of length 3*1e9.
  2. Stronger link between the sequence and molecular phenotypes.

The problems of this approach are:

  1. Not everything we would like can be measured
  2. Measurements are often noisy and require careful pre-processing
  3. Biology can be very complex (e.g. involves a lot of moving parts)

Combined approach

I think the way forward is in combining these two approaches and using functional genomics as a prior for the association studies. This was actually my original motivation to build Kipoi.

Other links: