ULMFit for Genomics or Proteins or Sequence Tagging Task

Does anyone try ULMFit for genomics or proteins or any other sequence tagging(not natural text) task?

I tried to play with it but had to give up because of the lack of GPU power.

I tried to use the hg38/GRCh38 human reference genome for my experimental setup with a simple base tokenization method and I also trained a sentencepiece model for tokenization.

This is something I am very interested in and maybe I will give it another try when I have more GPU power available. :slight_smile:

@MicPie, do you have anything interesting that came out of the Sentence Piece model? Did the tokens map nicely into codons for amino acid for the coding region? Were there any other tokens of interest (e.g. promoter region, etc)?

it was quite some time ago and I don’t remember anything special, but I didn’t looked into detail if the sentencepiece model captured special (longer) sequences.

The results looked quite like the example on the sentencepiece repo.

When fastai v2 is out I want to look into it again. - If you are interested in this topic too we can join forces and discuss our results.

1 Like

I tried today but it is not working from the start!
I got error msgs when I try to import utils *

NameError Traceback (most recent call last)
39 pass
—> 41 class GenomicVocab(Vocab):
42 def init(self, itos):
43 self.itos = itos

NameError: name ‘Vocab’ is not defined

Anyone have some kind of issue with this?