Deep Learning for Genomics/Bioinformatics/Comp Bio

Thanks! I think I’d misunderstood what the ‘bidirectional’ configuration in FastAI was actually doing. I’d seen people doing pseudo-bidirectional language modeling by training a forward LSTM, and a reverse LSTM, and using an ensemble of both models for their final prediction. Seeing as my computer didn’t explode from holding multiple instances of my model, I might have figured it out sooner, but mystery solved in any case.

I’ve been banging my head against another issue I wonder whether anyone’s encountered: leakage caused by closely related sequences in the training/validation sets.

Two organisms that are technically different species can have an average nucleotide identity of over 90%, and they’ll have lots of short sequence fragments on their genome that are identical or near-identical. So if you split your training & validation sets with completely different closely-related genomes in each, you might end up with ~identical sequences in both and have a leakage problem. I thought for a while this was what I was seeing with my weirdly high accuracies before I figured out the bidirectional thing.

I’ve run a couple quick tests with my corrected architecture and it doesn’t make much of a difference for my particular dataset - 34% accuracy in the first epoch for a model where no sequences in the training set have more than 85% ANI with sequences in the validation set, vs. 36% in the first epoch for a 95% ANI cutoff (genomes with higher than ~93-95% ANI can be considered the same species). This was splitting individual sequences rather than genomes, haven’t tested splitting by different genome ANI cutoffs yet. So perhaps my worries were unwarranted, but I wonder whether anyone else has encountered a leakage problem or had to work around it?

You could try the americangut where they’ll tell you who is in your gut as an example and maybe what they could be doing. It’s mostly a correlational but continuing to do other types of “omics” [metabolomics and proteomics] could give a better scope.

Hi,

Your work is great. I am trying to use fastai version 2 on the Driven Data genetic attribution challenge https://www.drivendata.org/competitions/63/genetic-engineering-attribution/

When I use a kmer size of 5 with a stride of 4 and use factory data loader method to load from data frame. The tokenization is taking long and memory is also getting filled. Would request your input how you managed to tokenize the sequence and whether you faced any memory issue

Interesting new work in that direction:


4 Likes

Just found this work published today!

1 Like

Hey all - I wanted to share my recent work applying self-supervised learning and transfer learning to biological data, using fastai v1, with this group that I think will be interested:

We trained a ‘universal language of life’ we call LookingGlass and demonstrate its usefulness for diverse downstream transfer learning tasks. It’s particularly geared towards metagenomes/short-read biological sequences typical of what you get off of next-gen sequencing machines, so distinguishes itself from e.g. the Facebook ESM model in that way. We also provide a python package (fastBio) that wraps fastai v1 to work with biological data. I hope it can be useful to others in this thread and would be interested in whatever feedback you have!

We recently posted our preprint:

The pretrained models are available in release v1 of the LookingGlass repo:

And the fastBio repo with docs/tutorial on using the python package (and pretrained models) are here:

8 Likes