Memory filled while tokenizing large DNA dataset

14corman · November 21, 2021, 2:41pm

Hello, I am trying to do NLP on a DNA based dataset with 500 files. Each file is about 3-6 MB in size. I am trying to tokenize the files using the following code:

ksize = 10
# Just a list ['A', 'T', 'C', 'G', 'X']
dna_alphabet_w_sep_char = supported_languages['dna'] + ['X']
# Make a vocab with all permutations of the above list of size 'ksize'
model_voc = BioVocab.create_from_ksize(ksize=ksize, alphabet=dna_alphabet_w_sep_char)
# Tokenizer that breaks input sequences into list of k-mers with BOS and EOS set for each k-mer
tok = BioTokenizer(ksize=ksize, stride=1)
dls_lm = TextDataLoaders.from_folder(temp_path, is_lm=True, valid_pct=0.1, seq_len=ksize, text_vocab=model_voc, tok_tfm=tok)

I have tried tokenizing in both parllel and single threaded. Both have the tok folder generated and the files slowly loaded until memory is full and the program has to be stopped. No progress is ever made into actual tokenizing the files. I have also tried dropping the number of files down to 100 instead of 500. Same memory fill up problem. In total, the 500 files take up about 2.5 GB of space, so I would think the max RAM usage would be 2.5 GB, but I guess that is not the case.

Computer specs:
RAM: 32 GB
CPU: 8 cores, 16 threads
GPU: GTX 1080
OS: Windows 10

The versions of libraries I am using:
fastai: 2.5.3
fastcore: 1.3.26
pytorch: 1.10.0

Does anyone have any suggestions or ideas on why this may be happening, and how to get around it? I am only using a small portion of my overall corpus which has 16,000 files. My goal would be to use all files, but, right now, I am just trying to get a small working example.

Thank you for any help.