Tokenization + Numericalization for genetic short sequences

sgmiriuka · June 10, 2020, 10:40am

Hi there
sorry for the simple question, but I’m new to language models and text classification.
I’m working with DNA sequences, have many file of them with a bunch of short sequences of them, each sequence of approx 15 bases, and want to perform text classification. Now, I understand the I only need to tokenized just on spaces, and then performed Numericalization. Is that right? How should I proceed?

Basically, I don’t want to apply any defaults, like a lang or replace_rep for repetitions. As a matter of fact, text is clean, but they are not words, just short sequences representing ‘words’, which are repeated from time to time in different text files. Hence I think I need to build a language model from scratch based on simple rules (just spaces I guess).

Any clues? anyone working on something similar?

Aninda · October 10, 2020, 3:54pm

Hi
You can use SubWord tokenization