I’ve been investigating why it takes so much system memory to deal with language model data in fastai. I’ve found that after tokenization and numericalization, the pickled representation takes about 3x as much space on disk (and in memory) as the original data.
After much experimentation, I’ve found that there are three main causes of this:
-
dataset.inner_df
stores a complete copy of the text that is never used - Each
Text
object stores a text representation of its numericalizeddata
and - The default size of numericalized items and lists is np.int64
Removing the first two things and changing the types to np.uint16 reduces file usage by ~65% (for my 1.0GB csv, the processed version goes from 2.9GB pickled down to 1.0GB). I wish I could get that smaller… but python/pickle seems to have quite a bit of overhead (and I’m more concerned about memory size than disk size anyway)
Removing the text
property is no problem because we can instead just store a reference to the Vocab
and use that to reconstruct text
in __getattr__
when needed.
The downside to uint16
is max vocab size of 65535… but since we know Vocab size at Text
creation time it’d be easy to detect if a uint32 is needed.
Thoughts on removing inner_df, linking Vocab instead of text, and shrinking the numericalized data to the smallest necessary type? And how to best implement?
Here are my code snippets I used for testing:
# remove inner_df
data_lm.train_dl.dl.dataset.inner_df.drop(data_lm.train_dl.dl.dataset.inner_df.index[:], inplace=True)
data_lm.valid_dl.dl.dataset.inner_df.drop(data_lm.valid_dl.dl.dataset.inner_df.index[:], inplace=True)
# convert to uint16
for dl, _ in enumerate(data_lm.dls):
for i, arr in enumerate(data_lm.dls[dl].items):
data_lm.dls[dl].items[i] = data_lm.dls[dl].items[i].astype(np.uint16)
class Text(ItemBase):
"Basic item for <code>text</code> data in numericalized `ids`."
def __init__(self, ids, vocab): self.data,self.vocab = np.array(ids, dtype=np.uint16),vocab
def __getattr__(self, attr):
if attr == 'text':
if self.vocab == None: return str(self.data)
return str(self.vocab.textify(self.data))
def __str__(self): return self.text