I’ve been investigating why it takes so much system memory to deal with language model data in fastai. I’ve found that after tokenization and numericalization, the pickled representation takes about 3x as much space on disk (and in memory) as the original data.
After much experimentation, I’ve found that there are three main causes of this:
dataset.inner_dfstores a complete copy of the text that is never used
Textobject stores a text representation of its numericalized
- The default size of numericalized items and lists is np.int64
Removing the first two things and changing the types to np.uint16 reduces file usage by ~65% (for my 1.0GB csv, the processed version goes from 2.9GB pickled down to 1.0GB). I wish I could get that smaller… but python/pickle seems to have quite a bit of overhead (and I’m more concerned about memory size than disk size anyway)
text property is no problem because we can instead just store a reference to the
Vocab and use that to reconstruct
__getattr__ when needed.
The downside to
uint16 is max vocab size of 65535… but since we know Vocab size at
Text creation time it’d be easy to detect if a uint32 is needed.
Thoughts on removing inner_df, linking Vocab instead of text, and shrinking the numericalized data to the smallest necessary type? And how to best implement?
Here are my code snippets I used for testing:
# remove inner_df data_lm.train_dl.dl.dataset.inner_df.drop(data_lm.train_dl.dl.dataset.inner_df.index[:], inplace=True) data_lm.valid_dl.dl.dataset.inner_df.drop(data_lm.valid_dl.dl.dataset.inner_df.index[:], inplace=True) # convert to uint16 for dl, _ in enumerate(data_lm.dls): for i, arr in enumerate(data_lm.dls[dl].items): data_lm.dls[dl].items[i] = data_lm.dls[dl].items[i].astype(np.uint16)
class Text(ItemBase): "Basic item for <code>text</code> data in numericalized `ids`." def __init__(self, ids, vocab): self.data,self.vocab = np.array(ids, dtype=np.uint16),vocab def __getattr__(self, attr): if attr == 'text': if self.vocab == None: return str(self.data) return str(self.vocab.textify(self.data)) def __str__(self): return self.text