Tokenization + Numericalization for genetic short sequences

Hi there
sorry for the simple question, but I’m new to language models and text classification.
I’m working with DNA sequences, have many file of them with a bunch of short sequences of them, each sequence of approx 15 bases, and want to perform text classification. Now, I understand the I only need to tokenized just on spaces, and then performed Numericalization. Is that right? How should I proceed?

Basically, I don’t want to apply any defaults, like a lang or replace_rep for repetitions. As a matter of fact, text is clean, but they are not words, just short sequences representing ‘words’, which are repeated from time to time in different text files. Hence I think I need to build a language model from scratch based on simple rules (just spaces I guess).

Any clues? anyone working on something similar?

1 Like

You can use SubWord tokenization