Can I safely clear a DataSet's inner_df property?

I was trying to figure out why my pickled language model’s file size is so huge (it’s bigger than the original dataset even after it’s been numericalized).

I found out that it’s storing the original text in the inner_df property.

Once I’ve finished doing all my processing and am ready to save to disk using data_lm.save is it safe to set dataset.inner_df=None or does it need to use that again at some point?

Edit 1: this seems to reduce the filesize of .save()'s pickled result by ~33%. But haven’t tried training yet to know if it’ll still work:

for dl in data_lm.dls:
    dl.inner_df.drop(dl.inner_df.index[:], inplace=True)

Anything else to try removing? I pickled itos and stoi separately and they account for far less than 1% of the total file size.

Edit 2: since my Vocabulary is less than 65535 I think I should be able to convert my dataset items to uint16 and that reduced the storage by another ~48% (I had them as 32 bit previously; I think the fastai default is 64 bit via text/data.py:304). Again, haven’t trained a model like this yet so we’ll see if it actually works.

for ds, _ in enumerate(data_lm.dls)
for i, _ in enumerate(data_lm.dls[ds].items):
    data_lm.dls[ds].items[i] = data_lm.dls[ds].items[i].astype(np.uint16)

If it’s not tabular data, the inner_df isn’t used after labeling, so it should be safe to drop it, yes.

1 Like