ULMFit for Genomics or Proteins or Sequence Tagging Task

(Md Mofijul Islam) #1

Does anyone try ULMFit for genomics or proteins or any other sequence tagging(not natural text) task?


(Michael) #2

I tried to play with it but had to give up because of the lack of GPU power.

I tried to use the hg38/GRCh38 human reference genome for my experimental setup with a simple base tokenization method and I also trained a sentencepiece model for tokenization.

This is something I am very interested in and maybe I will give it another try when I have more GPU power available. :slight_smile:



@MicPie, do you have anything interesting that came out of the Sentence Piece model? Did the tokens map nicely into codons for amino acid for the coding region? Were there any other tokens of interest (e.g. promoter region, etc)?


(Michael) #4

it was quite some time ago and I don’t remember anything special, but I didn’t looked into detail if the sentencepiece model captured special (longer) sequences.

The results looked quite like the example on the sentencepiece repo.

When fastai v2 is out I want to look into it again. - If you are interested in this topic too we can join forces and discuss our results.

1 Like