I was trying to figure out why my pickled language model’s file size is so huge (it’s bigger than the original dataset even after it’s been numericalized).
I found out that it’s storing the original text in the inner_df
property.
Once I’ve finished doing all my processing and am ready to save to disk using data_lm.save
is it safe to set dataset.inner_df=None
or does it need to use that again at some point?
Edit 1: this seems to reduce the filesize of .save()
's pickled result by ~33%. But haven’t tried training yet to know if it’ll still work:
for dl in data_lm.dls:
dl.inner_df.drop(dl.inner_df.index[:], inplace=True)
Anything else to try removing? I pickled itos and stoi separately and they account for far less than 1% of the total file size.
Edit 2: since my Vocabulary is less than 65535 I think I should be able to convert my dataset items to uint16
and that reduced the storage by another ~48% (I had them as 32 bit previously; I think the fastai default is 64 bit via text/data.py:304
). Again, haven’t trained a model like this yet so we’ll see if it actually works.
for ds, _ in enumerate(data_lm.dls)
for i, _ in enumerate(data_lm.dls[ds].items):
data_lm.dls[ds].items[i] = data_lm.dls[ds].items[i].astype(np.uint16)