Hi,
Could DL be employed to analyze Genomics data, the data which is much much wider (1 Million to 30 Million columns) than long (10 K to 100 K rows)? Any suggestions, resources, examples will be highly appreciated.
Thanks
Hi,
Could DL be employed to analyze Genomics data, the data which is much much wider (1 Million to 30 Million columns) than long (10 K to 100 K rows)? Any suggestions, resources, examples will be highly appreciated.
Thanks
Here’s a recent review paper:
Awesome, I will go through the suggested review. Thanks
Hi. I have worked in the area on how to infer genomics information from MRI images and using it to predict cancer growth in patients in the Asia context. I am also currently self learning bioinformatics and applying it in genomics data in my side-projects.
I learned about deep learning in biology and medicine landscape through this paper “Opportunities and obstacles for deep learning in biology and medicine”. I think it is quite useful.
I also followed the work of:
RNN/NLP in genomics:
If you are looking for the cutting edge model zoo for regulatory genomics including Google DeepVariant, check out:
This is not meant to be exhaustive. Almost impossible as there’s so much out there now. Here’s one giant list of deep learning implementations in biology.
Thanks for the fantastic information.
What kind of DL - Genomics projects you are doing / have done? Any insight will be really helpful.
Thanks a lot
@caspase8 abolutely. What use-cases do you have in mind?
In genomics, there are generally two complementary approaches of trying to understand the relationship between the genotype (DNA sequence) to phenotype (say disease status). Ideally we would like to have a function: phenotype = f(genotype, environment)
which could predict the outcome (say health) from your genetic background and your lifestyle (environment). Having such function would be immensely important, as you could start asking questions like what mutations increase the risk of having the disease. Knowing the problematic mutations helps to pinpoint the mechanism of the diseases as you know the exact origin.
The classical way is to sequence many people (say 500k - http://www.ukbiobank.ac.uk/) and then try to find statistical associations between individual mutations and the phenotype. What you practically end up with is a binary matrix of mutations X of shape (1e5, 1e7) (first axis represents the number of people and the second axis represents each mutation) and binary vector y of shape (1e5) denoting the phenotype of interest (say 1=disease, 0=healthy).
Then you do 1e7 statistical tests: test(y, X[:, i]) asking the questions whether a particular variant is associated with the phenotype. There are several problems with this approach:
The goal of functional genomics is to understand the impact of mutations to molecular phenotypes. Molecular phenotypes are for example the amount of certain proteins in the cell. The advantage of this approach is that one can measure thousands or millions of molecular phenotypes in the cell for each person (say the abundance of 20000 proteins). Each molecular phenotype is typically related only to a known subpart of the genome (say 1000bp of DNA sequence). Hence, one can build a model (typically a CNN) to predict those molecular phenotypes from the DNA sequence itself. Mutations are then scored as the difference in predictions if the sequence contains a mutation of interest or not (see the DeepSEA paper). We can then use this predicted difference as features and associate those to the phenotype of interest.
The advantage is twofold:
The problems of this approach are:
I think the way forward is in combining these two approaches and using functional genomics as a prior for the association studies. This was actually my original motivation to build Kipoi.
Other links:
So glad to find this topic. I have similar interests in applying DL to genomics related problems. By any chances, is any of you attending the Fastai v3 starting soon ? Would be neat to collaborate and work on a real world problem.
Hello everybody,
I would be also interested in DL for genomics and I will attend the v3 course!
I recently tried to setup an ULMfit RNN for sequence classification with the data from the last PrecisionFDA CDRH Biothreat Challenge but was not successful due to the sheer amount of data (training took forever for a single epoch).
I am not a bioinformatics expert but I would be still interested in similar projects!
Maybe we can join forces to share interesting literature, projects, repos, upcoming PrecisionFDA competitions?
Best regards
Michael
PS: This thread could be also of interest: Autoencoder for gene sequences
Definitely would be interested in joining forces and being able to come up with a cool use case.
Definitely would be interested in collaborating. I posted the original post and would be very interested in Kling forces.
Maybe FYI:
Very interesting:
Hi MichaelG
I have been trying to apply some of the approaches outlined in the papers linked above myself over the last couple of months.
i.e. fine-tuning protein language models on antibody v domain amino acid sequences and building a classifier/regressor on the output embeddings.
It is indeed very difficult! I have had limited success so far. there are not that many public datasets mapping antibody sequences to binding affinity. In addition a new function will have to be relearned for every antibody-antigen interaction as they are all unique.
Did you give it a go in the end?
Hi
Can you help me ?
I want to ask you in bioinformatics by machine learning
Hello everyone,
we recently started https://OpenBioML.org with the help of stability.ai.
If you are interested in doing cutting edge and open bio ML research feel free to have a look at our website and join our discord server!