For a current deep learning project, I am dealing with a dataset that is relatively large for a deep learning dataset; the One Billion Words Benchmark. I need to be able to efficiently load this dataset in a way that is applicable to use for deep learning related modeling.
With lots and lots of testing, I’ve found these things about preprocessing the dataset.
- Using the raw text files(.txt), I won’t be able to use progressive loading(loading each batch on the fly).
- Tokenizing the dataset using the spacy tokenizer takes a lot of memory. Therefore, I can’t tokenize the entire dataset at once, even if the dataset fits into memory. This means that I will need to process the dataset in chunks.
Of course, after preprocessing the dataset, you have to store the data back onto disk, which begs the question of what strategy for doing this is the most efficient. Here is what I have tried so far:
- HDF5(Pandas hdfstore and h5py): The main problem with using hdf5 in general is that it generally only deals with numerical data, and is significantly slower for text data.
- Feather: Extremely fast and efficient, but no querying
- Parquet: Fast, but not efficient with many row groups for querying
- Msgpack: Very fast, but again no querying
- CSV: Much faster than usual due to the use of only text data, but still slow
- Plain text: Surprisingly fast, but tokenized text writing is using a
forloop, with reading using
f.readlines()for querying. However, data is stored in many files and metadata is stored in a separate file as well.
After testing these formats, I can give a general guideline to what is required or preferred:
- Reading needs to support querying for efficient RAM management
- IO time can’t be unreasonable
- Lower RAM consumption when writing/reading
- Small file(s) sizes
- Only use one file(ie be able to append to file)
- Support for metadata in file
I am trying to find a general data storage format with high flexibility and speed that can store both numerical and text data.