DL and Genomics

Hi,

Could DL be employed to analyze Genomics data, the data which is much much wider (1 Million to 30 Million columns) than long (10 K to 100 K rows)? Any suggestions, resources, examples will be highly appreciated.

Thanks

3 Likes

Here’s a recent review paper:

https://arxiv.org/abs/1802.00810

4 Likes

Awesome, I will go through the suggested review. Thanks

Hi. I have worked in the area on how to infer genomics information from MRI images and using it to predict cancer growth in patients in the Asia context. I am also currently self learning bioinformatics and applying it in genomics data in my side-projects.

I learned about deep learning in biology and medicine landscape through this paper “Opportunities and obstacles for deep learning in biology and medicine”. I think it is quite useful.

I also followed the work of:

RNN/NLP in genomics:

  • Finding functional regions (e.g., regulatory regions)
  • DeepGo (protein function classification from sequence)

If you are looking for the cutting edge model zoo for regulatory genomics including Google DeepVariant, check out:

This is not meant to be exhaustive. Almost impossible as there’s so much out there now. Here’s one giant list of deep learning implementations in biology. :roll_eyes:

17 Likes

Thanks for the fantastic information.

What kind of DL - Genomics projects you are doing / have done? Any insight will be really helpful.

Thanks a lot

@caspase8 abolutely. What use-cases do you have in mind?

In genomics, there are generally two complementary approaches of trying to understand the relationship between the genotype (DNA sequence) to phenotype (say disease status). Ideally we would like to have a function: phenotype = f(genotype, environment) which could predict the outcome (say health) from your genetic background and your lifestyle (environment). Having such function would be immensely important, as you could start asking questions like what mutations increase the risk of having the disease. Knowing the problematic mutations helps to pinpoint the mechanism of the diseases as you know the exact origin.

1. Genome-wide association studies (GWAS) (top-down)

The classical way is to sequence many people (say 500k - http://www.ukbiobank.ac.uk/) and then try to find statistical associations between individual mutations and the phenotype. What you practically end up with is a binary matrix of mutations X of shape (1e5, 1e7) (first axis represents the number of people and the second axis represents each mutation) and binary vector y of shape (1e5) denoting the phenotype of interest (say 1=disease, 0=healthy).

Then you do 1e7 statistical tests: test(y, X[:, i]) asking the questions whether a particular variant is associated with the phenotype. There are several problems with this approach:

  1. one ignores all the knowledge about the the genome, its organization and the function.
  2. since one has to correct for multiple testing, the statistical power of detecting a significant association is low.
  3. “Correlation does not imply causation”. Even if we observe a statistical association, this doesn’t mean that the link is causal (e.g. the mutation might just be correlated in the population with the true causal variant). Genetic variants (other word for mutation) are co-inherited (phenomena called linkage-disequilibrium, columns of X are not all independent of each other). There have been some interesting attempts recently to tackle the correlation vs causation problem (Implicit Causal Models for Genome-wide Association Studies). However, it will require much more work to be done in this direction to show it’s utility in practice.

2. Functional genomics (bottom-up)

The goal of functional genomics is to understand the impact of mutations to molecular phenotypes. Molecular phenotypes are for example the amount of certain proteins in the cell. The advantage of this approach is that one can measure thousands or millions of molecular phenotypes in the cell for each person (say the abundance of 20000 proteins). Each molecular phenotype is typically related only to a known subpart of the genome (say 1000bp of DNA sequence). Hence, one can build a model (typically a CNN) to predict those molecular phenotypes from the DNA sequence itself. Mutations are then scored as the difference in predictions if the sequence contains a mutation of interest or not (see the DeepSEA paper). We can then use this predicted difference as features and associate those to the phenotype of interest.

The advantage is twofold:

  1. More data: For a single sample (say person) we fit a model from 20k sequences of length 1kb instead of a single sequence of length 3*1e9.
  2. Stronger link between the sequence and molecular phenotypes.

The problems of this approach are:

  1. Not everything we would like can be measured
  2. Measurements are often noisy and require careful pre-processing
  3. Biology can be very complex (e.g. involves a lot of moving parts)

Combined approach

I think the way forward is in combining these two approaches and using functional genomics as a prior for the association studies. This was actually my original motivation to build Kipoi.

Other links:

7 Likes

So glad to find this topic. I have similar interests in applying DL to genomics related problems. By any chances, is any of you attending the Fastai v3 starting soon ? Would be neat to collaborate and work on a real world problem.

3 Likes

Hello everybody,

I would be also interested in DL for genomics and I will attend the v3 course!

I recently tried to setup an ULMfit RNN for sequence classification with the data from the last PrecisionFDA CDRH Biothreat Challenge but was not successful due to the sheer amount of data (training took forever for a single epoch).

I am not a bioinformatics expert but I would be still interested in similar projects! :slight_smile:

Maybe we can join forces to share interesting literature, projects, repos, upcoming PrecisionFDA competitions?

Best regards
Michael

PS: This thread could be also of interest: Autoencoder for gene sequences

2 Likes

Definitely would be interested in joining forces and being able to come up with a cool use case.

Definitely would be interested in collaborating. I posted the original post and would be very interested in Kling forces.

1 Like

Firstly congratulations on publishing the Fastai frontend in O’Reilly. Thats in fact how I found out about Fastai and can’t wait to recieve my pre-advanced copy.

I am a bioinformatician (Python/shell/HPC) and am very interested in deploying Fastai to understand antibody binding. The ease at generating a model is extremely impressive and enables much more time to be dedicated on optimising the model, which what I do - explore biological parameter space. We certainly have the data to train an ANN and I’ve a neat (in my opinion of course!) way to vectorise amino acid sequence and thereon implement CNN and RNN.

My percception is that Fastai is totally brilliant for established data, like image recognition via transfer learning and the ULMfit RNN for NLP. My question is probably naive, but anyway: deploying e.g. GRU or LSHM for example to a load of vectorised protein data appears to be more challenging? Or is this simply I have not learnt enough about Fastai? The reason for asking is simply that in an exploratory application we don’t really know what the precise model is.

Maybe FYI: