NLP - recommended data storage format to store large tokenized text

For a current deep learning project, I am dealing with a dataset that is relatively large for a deep learning dataset; the One Billion Words Benchmark. I need to be able to efficiently load this dataset in a way that is applicable to use for deep learning related modeling.

With lots and lots of testing, I’ve found these things about preprocessing the dataset.

  • Using the raw text files(.txt), I won’t be able to use progressive loading(loading each batch on the fly).
  • Tokenizing the dataset using the spacy tokenizer takes a lot of memory. Therefore, I can’t tokenize the entire dataset at once, even if the dataset fits into memory. This means that I will need to process the dataset in chunks.

Of course, after preprocessing the dataset, you have to store the data back onto disk, which begs the question of what strategy for doing this is the most efficient. Here is what I have tried so far:

  • HDF5(Pandas hdfstore and h5py): The main problem with using hdf5 in general is that it generally only deals with numerical data, and is significantly slower for text data.
  • Feather: Extremely fast and efficient, but no querying
  • Parquet: Fast, but not efficient with many row groups for querying
  • Msgpack: Very fast, but again no querying
  • CSV: Much faster than usual due to the use of only text data, but still slow
  • Plain text: Surprisingly fast, but tokenized text writing is using a for loop, with reading using f.readlines() for querying. However, data is stored in many files and metadata is stored in a separate file as well.

After testing these formats, I can give a general guideline to what is required or preferred:

Required:

  • Reading needs to support querying for efficient RAM management
  • IO time can’t be unreasonable

Preferred:

  • Lower RAM consumption when writing/reading
  • Small file(s) sizes
  • Only use one file(ie be able to append to file)
  • Support for metadata in file

I am trying to find a general data storage format with high flexibility and speed that can store both numerical and text data.

3 Likes

Assuming you are using fast.ai, the TextDataBunch will contain your preprocessed text. You can then use the supplied save and load_data methods.

You may need to experiment with the different methods of creating the TextDataBunch although from_folder sounds like the most relevant. A chunksize for the Tokenizer and Numericalizer (processors) can be specified as a parameter.

1 Like