DL and Genomics

caspase8 · May 14, 2018, 11:30pm

Hi,

Could DL be employed to analyze Genomics data, the data which is much much wider (1 Million to 30 Million columns) than long (10 K to 100 K rows)? Any suggestions, resources, examples will be highly appreciated.

Thanks

jeremy · May 15, 2018, 5:59pm

Here’s a recent review paper:

https://arxiv.org/abs/1802.00810

caspase8 · May 15, 2018, 9:34pm

Awesome, I will go through the suggested review. Thanks

cedric · May 16, 2018, 2:07pm

Hi. I have worked in the area on how to infer genomics information from MRI images and using it to predict cancer growth in patients in the Asia context. I am also currently self learning bioinformatics and applying it in genomics data in my side-projects.

I learned about deep learning in biology and medicine landscape through this paper “Opportunities and obstacles for deep learning in biology and medicine”. I think it is quite useful.

I also followed the work of:

Brendan Frey Lab (functional genomics, gene expression)
Kundaje Lab (gene regulation)

RNN/NLP in genomics:

Finding functional regions (e.g., regulatory regions)
- DeepSea
- dna2vec
DeepGo (protein function classification from sequence)

If you are looking for the cutting edge model zoo for regulatory genomics including Google DeepVariant, check out:

Kipoi

This is not meant to be exhaustive. Almost impossible as there’s so much out there now. Here’s one giant list of deep learning implementations in biology.

caspase8 · May 17, 2018, 10:31pm

Thanks for the fantastic information.

What kind of DL - Genomics projects you are doing / have done? Any insight will be really helpful.

Thanks a lot

Avsecz · June 13, 2018, 6:29am

@caspase8 abolutely. What use-cases do you have in mind?

In genomics, there are generally two complementary approaches of trying to understand the relationship between the genotype (DNA sequence) to phenotype (say disease status). Ideally we would like to have a function: phenotype = f(genotype, environment) which could predict the outcome (say health) from your genetic background and your lifestyle (environment). Having such function would be immensely important, as you could start asking questions like what mutations increase the risk of having the disease. Knowing the problematic mutations helps to pinpoint the mechanism of the diseases as you know the exact origin.

1. Genome-wide association studies (GWAS) (top-down)

The classical way is to sequence many people (say 500k - http://www.ukbiobank.ac.uk/) and then try to find statistical associations between individual mutations and the phenotype. What you practically end up with is a binary matrix of mutations X of shape (1e5, 1e7) (first axis represents the number of people and the second axis represents each mutation) and binary vector y of shape (1e5) denoting the phenotype of interest (say 1=disease, 0=healthy).

Then you do 1e7 statistical tests: test(y, X[:, i]) asking the questions whether a particular variant is associated with the phenotype. There are several problems with this approach:

one ignores all the knowledge about the the genome, its organization and the function.
since one has to correct for multiple testing, the statistical power of detecting a significant association is low.
“Correlation does not imply causation”. Even if we observe a statistical association, this doesn’t mean that the link is causal (e.g. the mutation might just be correlated in the population with the true causal variant). Genetic variants (other word for mutation) are co-inherited (phenomena called linkage-disequilibrium, columns of X are not all independent of each other). There have been some interesting attempts recently to tackle the correlation vs causation problem (Implicit Causal Models for Genome-wide Association Studies). However, it will require much more work to be done in this direction to show it’s utility in practice.

2. Functional genomics (bottom-up)

The goal of functional genomics is to understand the impact of mutations to molecular phenotypes. Molecular phenotypes are for example the amount of certain proteins in the cell. The advantage of this approach is that one can measure thousands or millions of molecular phenotypes in the cell for each person (say the abundance of 20000 proteins). Each molecular phenotype is typically related only to a known subpart of the genome (say 1000bp of DNA sequence). Hence, one can build a model (typically a CNN) to predict those molecular phenotypes from the DNA sequence itself. Mutations are then scored as the difference in predictions if the sequence contains a mutation of interest or not (see the DeepSEA paper). We can then use this predicted difference as features and associate those to the phenotype of interest.

The advantage is twofold:

More data: For a single sample (say person) we fit a model from 20k sequences of length 1kb instead of a single sequence of length 3*1e9.
Stronger link between the sequence and molecular phenotypes.

The problems of this approach are:

Not everything we would like can be measured
Measurements are often noisy and require careful pre-processing
Biology can be very complex (e.g. involves a lot of moving parts)

Combined approach

I think the way forward is in combining these two approaches and using functional genomics as a prior for the association studies. This was actually my original motivation to build Kipoi.

Other links:

Here is another list of papers: https://github.com/gokceneraslan/awesome-deepbio

suvash · October 20, 2018, 3:28pm

So glad to find this topic. I have similar interests in applying DL to genomics related problems. By any chances, is any of you attending the Fastai v3 starting soon ? Would be neat to collaborate and work on a real world problem.

MicPie · October 21, 2018, 6:30am

Hello everybody,

I would be also interested in DL for genomics and I will attend the v3 course!

I recently tried to setup an ULMfit RNN for sequence classification with the data from the last PrecisionFDA CDRH Biothreat Challenge but was not successful due to the sheer amount of data (training took forever for a single epoch).

I am not a bioinformatics expert but I would be still interested in similar projects!

Maybe we can join forces to share interesting literature, projects, repos, upcoming PrecisionFDA competitions?

Best regards
Michael

PS: This thread could be also of interest: Autoencoder for gene sequences

caspase8 · October 21, 2018, 4:41pm

Definitely would be interested in joining forces and being able to come up with a cool use case.

caspase8 · October 21, 2018, 4:42pm

Definitely would be interested in collaborating. I posted the original post and would be very interested in Kling forces.

MicPie · July 2, 2020, 7:36am

Maybe FYI:

MicPie · July 16, 2020, 8:46am

Very interesting:

wjs20 · January 25, 2021, 4:17pm

Hi MichaelG

I have been trying to apply some of the approaches outlined in the papers linked above myself over the last couple of months.
i.e. fine-tuning protein language models on antibody v domain amino acid sequences and building a classifier/regressor on the output embeddings.

It is indeed very difficult! I have had limited success so far. there are not that many public datasets mapping antibody sequences to binding affinity. In addition a new function will have to be relearned for every antibody-antigen interaction as they are all unique.

Did you give it a go in the end?

Shaymaa_younis · September 30, 2021, 6:21pm

Hi
Can you help me ?
I want to ask you in bioinformatics by machine learning

MicPie · October 11, 2022, 3:06pm

Hello everyone,

we recently started https://OpenBioML.org with the help of stability.ai.

If you are interested in doing cutting edge and open bio ML research feel free to have a look at our website and join our discord server!