I was trying to figure out why my pickled language model’s file size is so huge (it’s bigger than the original dataset even after it’s been numericalized).
I found out that it’s storing the original text in the
Once I’ve finished doing all my processing and am ready to save to disk using
data_lm.save is it safe to set
dataset.inner_df=None or does it need to use that again at some point?
Edit 1: this seems to reduce the filesize of
.save()'s pickled result by ~33%. But haven’t tried training yet to know if it’ll still work:
for dl in data_lm.dls: dl.inner_df.drop(dl.inner_df.index[:], inplace=True)
Anything else to try removing? I pickled itos and stoi separately and they account for far less than 1% of the total file size.
Edit 2: since my Vocabulary is less than 65535 I think I should be able to convert my dataset items to
uint16 and that reduced the storage by another ~48% (I had them as 32 bit previously; I think the fastai default is 64 bit via
text/data.py:304). Again, haven’t trained a model like this yet so we’ll see if it actually works.
for ds, _ in enumerate(data_lm.dls) for i, _ in enumerate(data_lm.dls[ds].items): data_lm.dls[ds].items[i] = data_lm.dls[ds].items[i].astype(np.uint16)