Saving and loading data and models

I’d like to use this thread for questions about saving and loading databunches and models.

  1. I’ve seen both the use of data.save and data.export for saving a newly created databunch using the datablock API. What is the difference between the two and when to use which?
  2. While working with text data for language model, I was able to run the following code for creation, save, and loading the text databunch:

Creation and save:

tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = (TextList.from_df(texts_df, path, col=['name', 'item_description'], processor=[tok_proc, num_proc])
          .random_split_by_pct(0.1)
          .label_for_lm()
          .databunch())

data_lm.save('lm-toknum')

Load:

tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])
data_lm.show_batch()

What does data_lm.export() do in this context?

  1. Similar to above, I created a tabular databunch using the datablock API:
data_str = (TabularList.from_df(train_df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=[Categorify], test_df=test_df)
           .split_from_df(col='is_valid')
           .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
           .databunch(bs=128))

But when I tried to save it using data_str.save(), I got the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-76-6facd85bdfbe> in <module>
----> 1 data_str.save('data-str')

~/fastai/fastai/basic_data.py in __getattr__(self, k)
    118         return cls(*dls, path=path, device=device, tfms=tfms, collate_fn=collate_fn, no_check=no_check)
    119 
--> 120     def __getattr__(self,k:int)->Any: return getattr(self.train_dl, k)
    121 
    122     def dl(self, ds_type:DatasetType=DatasetType.Valid)->DeviceDataLoader:

~/fastai/fastai/basic_data.py in __getattr__(self, k)
     33 
     34     def __len__(self)->int: return len(self.dl)
---> 35     def __getattr__(self,k:str)->Any: return getattr(self.dl, k)
     36 
     37     @property

~/fastai/fastai/basic_data.py in DataLoader___getattr__(dl, k)
     18 torch.utils.data.DataLoader.__init__ = intercept_args
     19 
---> 20 def DataLoader___getattr__(dl, k:str)->Any: return getattr(dl.dataset, k)
     21 DataLoader.__getattr__ = DataLoader___getattr__
     22 

~/fastai/fastai/data_block.py in __getattr__(self, k)
    504         res = getattr(y, k, None)
    505         if res is not None: return res
--> 506         raise AttributeError(k)
    507 
    508     def __getitem__(self,idxs:Union[int,np.ndarray])->'LabelList':

AttributeError: save

How do I save this databunch?

  1. I was able to use the export method to save something (I’m not sure what was saved). But how do I load this saved databunch back into a variable later (similar to loading the text databunch showed earlier)?

Thanks.

1 Like

Save and export will probably be merged in the future. Note that there is no save attribute for tabular data, it only exists in text, which is why you see that error.
Export saves the inner state of your databunch (not the data but everything else: transforms, normalization, processors, classes…) for inference. Save, saves the data in the case of text, to avoid redoing the numericalization/tokenization.

3 Likes

Thank you for your reply. Let’s say I do all my preprocessing for tabular data and create a databunch and use export method to save the inner state. How would I load my that (along with my data) back in another session to start/continue training?

For now there is nothing apart from creating it each time you run the notebook. We’ll work on serialization after the course.

This is off-topic sorry for that, but I didn’t want to create a new thread just to ask this small question. The formula for determining embedding sizes of the categorical variables:

def emb_sz_rule(n_cat:int)->int: return min(600, round(1.6 * n_cat**0.56))

Is there any reference for this? How was this chosen? I’m guessing since word2vec itself requires a maximum of 600 and there more than 100,000 words, we don’t really need more than that. But what about the second arg?

Actually with fastai.tabular there are both options (export, save). Do I even need to use save/load or is it enough to use export/load_learner for inference? I don’t understand why ‘saved’ model is 3x bigger then ‘exported’ pickle…do you loose anything with just exporting?

It’s really weird that your saved model is 3x bigger than the exported pickle since bother are saved by torch behind the scene, and save only saves the model weights, whereas export saves the architecture, the state of the processors…
In any case, export then load_learner is the recommended way to do inference, yes.

2 Likes

I think I start understanding it - save/load is meant for continuation of training and export/load_learner is for inference, correct?

7 Likes

Exactly

Related to this thread - I’m seeing very long save times for
data_lm.save(‘data_lm.pkl’)
in the lesson3-imdb notebook.
Actually it seems to never finish.

Does anyone have tips on how to debug this?
I’m not sure I’ve used pickle files in a while, so can I make sure the right python libraries are installed / generally make sure it’s doing something while I wait?

It shouldn’t take too long, in my case it’s something like ten seconds.

Thanks. For the record it worked much better after I restarted the instance the next day. :shrug:

Did you ever get a response on this? I’ve been searching for the same info.

1 Like

you could save as data_lm.save(“lm_toknum.pkl”) and load using load_data(path," lm_toknum.pkl")

1 Like

Hello

I am doing translate lesson(seq2seq) NLP 2019 and getting error below

how can i fix this error ?

and what about loading that ? i mean “data = load_data(path)”

It means the directory /root/.fastai/data/ does not exists or has no write permissions. Fix this to be able to save.

thanks for your reply! sgugger
how can i fix that ?

In Colab you would want to do data.path = Path(‘’). This will save it into the content directory for you to access easily and download if you were wanting to do so :slight_smile:

1 Like

very thanks to you !