I don’t know how to get your data from your provider, but you could join “All of Us” a U.S. wide program to sequence 1 million genomes. UCSF is the local facilitator.
I already gave them some samples, it took about 30 minutes and a few online forms. My theory is: better the NIH has my genome than a private company
He talks about generating random data to put through an ML model to see if you get similar results to what you get when you put in features that you think are important. I’m not sure how this will transfer to DL models, since we are not supposed to ‘curate’ the inputs.
Interesting paper this morning, “ Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts”
Honestly I don’t really understand the actual problem they’re solving ie why it’s important to improve prediction of chromatin accessibility, but that’s a bio question not a DL question. But the paper is quite clear in its technical implementation & model design. They use a fairly shallow ResNet architecture with 1d convolutions, they do their own kind of transfer learning (I’ll have to re-read to understand what they’re transferring between, it seems like “curriculum learning” to me but I’m not sure). It’s done in Pytorch.
Code repo (including links to the data!) here
Good twitter thread from co-lead-author with some more context here
Edit to add, I think it could be interesting to replicate in fastai, using as many tricks as we know, but at 30gb of data I assume it would take a while to train!
I was actually looking over this dataset earlier today. It’s a big dataset, but the 30 GB is a bit misleading. They actually process all the sequence data into one hot encoded matrices and save those directly. When I converted a sample back into text data + labels, it was about 2/3rds the size. Still that works out to 20 GB of sequence data which is a lot for consumer hardware.
What’s interesting about the dataset is it includes chromatin accessibility data for 123 cell lines, meaning the same sequence can different levels of accessibility in different lines. The authors deal with this by using a two-stage training process that is not well described. If I’m reading the paper correctly, first they train a model as a multi-class classification problem, mapping an input sequence to a 123-long vector of predictions. Then they transfer the learned weights to a new model that uses both the genomic sequence and some cell line specific metadata to make a single prediction.
Thanks for sharing this video and clearly showing the lack of pre-trained models in medicine. There’s definitely work to be done!
I heard something about being able to reverse engineer patients’ data from a model’s weights, especially individuals who are outliers on TWiML&AI’s podcast. Hypothetically, could it be possible to find a person’s metadata (height, age, location, etc) and connect it with rare mutations in their genome if that data was used to make a model?
If it might be possible, what can we do to prevent reverse engineering? I think TWiML mentioned something about adding noise to the data.
You could try SVAI (Silicon Valley Artificial Intelligence), it is a non-profit organizing AI/ML lectures and patient focused medical research cases. I attended their rare kidney disease hackathon (top 3 winner using fastai and a collaborative filtering approach) last year and will be attending their undiagnosed hackathon next month. The difference here is that for each hackathon their is a patient that has shared their information and that patient is also present so you can really deep dive into their history.
Ian Goodfellow discussed this in an interview with Lex Fridman, where he discussed using GANs to generate anonymized data that is then published to the world for researchers to use. There’s a branch of generative modeling called Domain-Adversarial learning.
Hi, I would like to join to the team
What I need to do?.
How do we start ?
It would be great to look for some resources, datasets and train some models before the competition starts