Deep Learning for Genomics/Bioinformatics/Comp Bio

I don’t know how to get your data from your provider, but you could join “All of Us” a U.S. wide program to sequence 1 million genomes. UCSF is the local facilitator.

I already gave them some samples, it took about 30 minutes and a few online forms. My theory is: better the NIH has my genome than a private company :slight_smile:

3 Likes

This is a good paper about how to do hypothesis driven and controlled AI experiments by Assistant Professor Michael Keiser at UCSF: “Adversarial Controls for Scientific Machine Learning”.

He talks about generating random data to put through an ML model to see if you get similar results to what you get when you put in features that you think are important. I’m not sure how this will transfer to DL models, since we are not supposed to ‘curate’ the inputs.

1 Like

Interesting paper this morning, “ Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts”

Honestly I don’t really understand the actual problem they’re solving ie why it’s important to improve prediction of chromatin accessibility, but that’s a bio question not a DL question. But the paper is quite clear in its technical implementation & model design. They use a fairly shallow ResNet architecture with 1d convolutions, they do their own kind of transfer learning (I’ll have to re-read to understand what they’re transferring between, it seems like “curriculum learning” to me but I’m not sure). It’s done in Pytorch.

Code repo (including links to the data!) here

Good twitter thread from co-lead-author with some more context here

Edit to add, I think it could be interesting to replicate in fastai, using as many tricks as we know, but at 30gb of data I assume it would take a while to train!

2 Likes

I was actually looking over this dataset earlier today. It’s a big dataset, but the 30 GB is a bit misleading. They actually process all the sequence data into one hot encoded matrices and save those directly. When I converted a sample back into text data + labels, it was about 2/3rds the size. Still that works out to 20 GB of sequence data which is a lot for consumer hardware.

What’s interesting about the dataset is it includes chromatin accessibility data for 123 cell lines, meaning the same sequence can different levels of accessibility in different lines. The authors deal with this by using a two-stage training process that is not well described. If I’m reading the paper correctly, first they train a model as a multi-class classification problem, mapping an input sequence to a 123-long vector of predictions. Then they transfer the learned weights to a new model that uses both the genomic sequence and some cell line specific metadata to make a single prediction.

2 Likes

Thanks for sharing this video and clearly showing the lack of pre-trained models in medicine. There’s definitely work to be done!

I heard something about being able to reverse engineer patients’ data from a model’s weights, especially individuals who are outliers on TWiML&AI’s podcast. Hypothetically, could it be possible to find a person’s metadata (height, age, location, etc) and connect it with rare mutations in their genome if that data was used to make a model?

If it might be possible, what can we do to prevent reverse engineering? I think TWiML mentioned something about adding noise to the data.

I guess if the weights are not shared the problem should be mitigated?
(However, this, of course, does not agree with open science.)

I found these publications in my library (that I still have to read in detail) which could be if interest for you:

Paper on transformer/language modeling approaches to learning protein structure

3 Likes

The code seems to be also on the way: https://twitter.com/soumithchintala/status/1124026943119286272

You could try SVAI (Silicon Valley Artificial Intelligence), it is a non-profit organizing AI/ML lectures and patient focused medical research cases. I attended their rare kidney disease hackathon (top 3 winner using fastai and a collaborative filtering approach) last year and will be attending their undiagnosed hackathon next month. The difference here is that for each hackathon their is a patient that has shared their information and that patient is also present so you can really deep dive into their history.

You could share your info and let us all hack it!

Here is the link SVAI

2 Likes

Ian Goodfellow discussed this in an interview with Lex Fridman, where he discussed using GANs to generate anonymized data that is then published to the world for researchers to use. There’s a branch of generative modeling called Domain-Adversarial learning.

Papers:

  1. Domain-Adversarial Training of Neural Networks
  2. Medical Image Synthesis for Data Augmentation and Anonymization using Generative Adversarial Networks
  3. Learning Anonymized Representations with Adversarial Neural Networks

Anyone interested in teaming up for this genomics based hackathon in SF on June 7th-9th?

@sparalic Is it necessary to attend in person, or can one participate remotely?

Yes. Remote participants are welcomed.

Then I’d be interested in joining a team.

Thinking about applying. Are you attending or hosting?

I’m thinking of signing up! I’m in healthcare but don’t know much about genomics data.

Hi, I would like to join to the team :smiley:
What I need to do?.
How do we start ?
It would be great to look for some resources, datasets and train some models before the competition starts

Here are few resources for Genomics in DL

DNA Sequence

Deep learning infrastructure for bioinformatics

ULMFiT for Genomic Sequence Data

DeepCRISPR

Have you seen this paper by Edward Choi? He too used GANs to generate synthetic patient data.

5 Likes

Recursion Pharma is dropping a large dataset of fluorescence microscopy images aimed at separating experimental effects from batch effects

3 Likes

This link is gone. Do you know where we can find the dataset?