Decreasing memory (and serialization) requirements of LM Text

I’ve been investigating why it takes so much system memory to deal with language model data in fastai. I’ve found that after tokenization and numericalization, the pickled representation takes about 3x as much space on disk (and in memory) as the original data.

After much experimentation, I’ve found that there are three main causes of this:

  1. dataset.inner_df stores a complete copy of the text that is never used
  2. Each Text object stores a text representation of its numericalized data and
  3. The default size of numericalized items and lists is np.int64

Removing the first two things and changing the types to np.uint16 reduces file usage by ~65% (for my 1.0GB csv, the processed version goes from 2.9GB pickled down to 1.0GB). I wish I could get that smaller… but python/pickle seems to have quite a bit of overhead (and I’m more concerned about memory size than disk size anyway)

Removing the text property is no problem because we can instead just store a reference to the Vocab and use that to reconstruct text in __getattr__ when needed.

The downside to uint16 is max vocab size of 65535… but since we know Vocab size at Text creation time it’d be easy to detect if a uint32 is needed.

Thoughts on removing inner_df, linking Vocab instead of text, and shrinking the numericalized data to the smallest necessary type? And how to best implement?


Here are my code snippets I used for testing:

# remove inner_df
data_lm.train_dl.dl.dataset.inner_df.drop(data_lm.train_dl.dl.dataset.inner_df.index[:], inplace=True)
data_lm.valid_dl.dl.dataset.inner_df.drop(data_lm.valid_dl.dl.dataset.inner_df.index[:], inplace=True)

# convert to uint16
for dl, _ in enumerate(data_lm.dls):
    for i, arr in enumerate(data_lm.dls[dl].items):
        data_lm.dls[dl].items[i] = data_lm.dls[dl].items[i].astype(np.uint16)
class Text(ItemBase):
    "Basic item for <code>text</code> data in numericalized `ids`."
    def __init__(self, ids, vocab): self.data,self.vocab = np.array(ids, dtype=np.uint16),vocab
    def __getattr__(self, attr):
        if attr == 'text':
            if self.vocab == None: return str(self.data)
            return str(self.vocab.textify(self.data))
    def __str__(self):  return self.text

For a real big datasets, I don’t think you should use the preprocessors for all those reasons. You can just define a TextItemList with the ids as items. Storing the texts for instance is there for display purposes only, and you don’t need it.

Not sure I follow. You still need to go from strings to numbers, no? And still need to define a vocabulary.

Yes, but for insanely large dataset you should only store that in memory. Possibly only load dynamically.

1 Like

Yeah I think I’m on the border with ~64 GB. That triples once processed though which isn’t helping.

Trying to squeeze it down so I don’t have to use such specialized cloud instances (many of the vast.ai machines I’ve been using only have ~128GB, and my local machine only has 64).