Deep Learning for Genomics/Bioinformatics/Comp Bio

deena-b · April 12, 2019, 4:36am

I don’t know how to get your data from your provider, but you could join “All of Us” a U.S. wide program to sequence 1 million genomes. UCSF is the local facilitator.

I already gave them some samples, it took about 30 minutes and a few online forms. My theory is: better the NIH has my genome than a private company

deena-b · April 12, 2019, 4:43am

This is a good paper about how to do hypothesis driven and controlled AI experiments by Assistant Professor Michael Keiser at UCSF: “Adversarial Controls for Scientific Machine Learning”.

He talks about generating random data to put through an ML model to see if you get similar results to what you get when you put in features that you think are important. I’m not sure how this will transfer to DL models, since we are not supposed to ‘curate’ the inputs.

ThomM · April 12, 2019, 3:17pm

Interesting paper this morning, “ Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts”

Honestly I don’t really understand the actual problem they’re solving ie why it’s important to improve prediction of chromatin accessibility, but that’s a bio question not a DL question. But the paper is quite clear in its technical implementation & model design. They use a fairly shallow ResNet architecture with 1d convolutions, they do their own kind of transfer learning (I’ll have to re-read to understand what they’re transferring between, it seems like “curriculum learning” to me but I’m not sure). It’s done in Pytorch.

Code repo (including links to the data!) here

Good twitter thread from co-lead-author with some more context here

Edit to add, I think it could be interesting to replicate in fastai, using as many tricks as we know, but at 30gb of data I assume it would take a while to train!

KarlH · April 13, 2019, 1:14am

I was actually looking over this dataset earlier today. It’s a big dataset, but the 30 GB is a bit misleading. They actually process all the sequence data into one hot encoded matrices and save those directly. When I converted a sample back into text data + labels, it was about 2/3rds the size. Still that works out to 20 GB of sequence data which is a lot for consumer hardware.

What’s interesting about the dataset is it includes chromatin accessibility data for 123 cell lines, meaning the same sequence can different levels of accessibility in different lines. The authors deal with this by using a two-stage training process that is not well described. If I’m reading the paper correctly, first they train a model as a multi-class classification problem, mapping an input sequence to a 123-long vector of predictions. Then they transfer the learned weights to a new model that uses both the genomic sequence and some cell line specific metadata to make a single prediction.

deena-b · April 19, 2019, 6:04pm

Thanks for sharing this video and clearly showing the lack of pre-trained models in medicine. There’s definitely work to be done!

I heard something about being able to reverse engineer patients’ data from a model’s weights, especially individuals who are outliers on TWiML&AI’s podcast. Hypothetically, could it be possible to find a person’s metadata (height, age, location, etc) and connect it with rare mutations in their genome if that data was used to make a model?

If it might be possible, what can we do to prevent reverse engineering? I think TWiML mentioned something about adding noise to the data.

MicPie · April 19, 2019, 6:21pm

I guess if the weights are not shared the problem should be mitigated?
(However, this, of course, does not agree with open science.)

I found these publications in my library (that I still have to read in detail) which could be if interest for you:

KarlH · May 3, 2019, 10:49pm

Paper on transformer/language modeling approaches to learning protein structure

MicPie · May 4, 2019, 9:11am

The code seems to be also on the way: https://twitter.com/soumithchintala/status/1124026943119286272

amritv · May 6, 2019, 3:26pm

You could try SVAI (Silicon Valley Artificial Intelligence), it is a non-profit organizing AI/ML lectures and patient focused medical research cases. I attended their rare kidney disease hackathon (top 3 winner using fastai and a collaborative filtering approach) last year and will be attending their undiagnosed hackathon next month. The difference here is that for each hackathon their is a patient that has shared their information and that patient is also present so you can really deep dive into their history.

You could share your info and let us all hack it!

Here is the link SVAI

poppingtonic · May 10, 2019, 9:25pm

Ian Goodfellow discussed this in an interview with Lex Fridman, where he discussed using GANs to generate anonymized data that is then published to the world for researchers to use. There’s a branch of generative modeling called Domain-Adversarial learning.

Papers:

sparalic · May 23, 2019, 12:34pm

Anyone interested in teaming up for this genomics based hackathon in SF on June 7th-9th?

jcatanza · May 23, 2019, 9:23pm

@sparalic Is it necessary to attend in person, or can one participate remotely?

sparalic · May 23, 2019, 10:16pm

Yes. Remote participants are welcomed.

jcatanza · May 24, 2019, 1:13am

Then I’d be interested in joining a team.

alexli · May 24, 2019, 2:20am

Thinking about applying. Are you attending or hosting?

sparalic · May 24, 2019, 5:14pm

I’m thinking of signing up! I’m in healthcare but don’t know much about genomics data.

ingbiodanielh · May 24, 2019, 6:25pm

Hi, I would like to join to the team
What I need to do?.
How do we start ?
It would be great to look for some resources, datasets and train some models before the competition starts

Here are few resources for Genomics in DL

DNA Sequence

Deep learning infrastructure for bioinformatics

ULMFiT for Genomic Sequence Data

DeepCRISPR

sparalic · May 25, 2019, 5:21pm

Have you seen this paper by Edward Choi? He too used GANs to generate synthetic patient data.

KarlH · June 24, 2019, 8:55pm

Recursion Pharma is dropping a large dataset of fluorescence microscopy images aimed at separating experimental effects from batch effects

ilovescience · June 26, 2019, 7:42pm

This link is gone. Do you know where we can find the dataset?