How to save Dataloaders?

Hello,

in fastai v1, it was possible to Load and save a DataBunch.

I do not see the same methods for Dataloaders of fastai v2. However, it would be very useful in particular in the case of Text Dataloaders where the training and validation tuples (x,y) are pre-processed at the Dataloaders creation time.

As an example, you can see the following code in the paragraph “Preparing the data” of the Transformers tutorial:

bs,sl = 8,1024
dls = tls.dataloaders(bs=bs, seq_len=sl)

If the training and validation datasets are huge, the Dataloaders creation can take a long time… It would be great to be able to save and load it.

What do you think?

1 Like

You can simply do:

torch.save(dls, 'fname.pkl')

And then it’ll work. Just do a torch.load() to bring it back in (this is actually most of what export does anyways :slight_smile: )

9 Likes

Thank you very much Zachary!
PS: you should apply for the free Sylvain Gugger seat :wink: cc @jeremy

1 Like

@muellerzr: your code works like a charm no notebook 10_nlp.ipynb do fastbook but not in the Transformer tutorial do @sgugger (see screen shot).

How to adapt your code to this case? Thanks.

Note: same problem with learn.export() but not with learn.save() which works.

Hmmmm… that would be up to Sylvain and how he has the transformers done… I l haven’t looked into that yet (pickles that much or transformers) though it looks like that was an old issue with HF: https://github.com/huggingface/tokenizers/issues/87

Yes, not all tokenizers from Hugging Face are serializable. It should be fixed in the next release from what I’ve followed.

2 Likes

Thanks Sylvain and Zachary.

@muellerzr Question on the Dataloaders one thing that can be nice, but also is a pain is that I can only save the data loader with the data until I have a model I don’t have the ability to take the preprocessing steps from a recently saved TabularDataLoader. ( I think this is why I am asking :slight_smile: )

This might be a thing in the DataBlockAPI, but I am currently in a tabular project mode for work.

dl_test = dl_train.test_dl(X_test, with_label=False) # could be true doesn't matter

This is fine when you are going to train and do inference in the same place and have enough ram to hold both data sets. However when using a tabular learn I don’t believe the training data is available and as I write this maybe it is, but I don’t think so.

learn_inf = load_learner(os.path.join(model_path, yaml.get('process_name') + yaml.get('dl_model_suffix')),
                             cpu=True)
test_dl = learn_inf.dls.test_dl(df_test, with_label=False)

Even though the fastai model is a little bigger than a typical model like an xgb model that is completely okay for the functionality it gives me.

Do you know of a way when

dl_train = (TabularDataLoaders.from_df(df_transform, procs=procs,
                                       cat_names=cat_vars, cont_names=cont_vars,
                                       y_names=0, y_block=y_block,
                                       valid_idx=splits[1], bs=bs))
if os.path.exists(p) is False:
     os.makedirs(f'{p}')
logging.info(f'{fn} getting saved to {p}')
file_path = os.path.join(p, '' f"{process_name}_{fn}.pkl")
logging.info(f'file saved to {file_path}')
torch.save(dl, file_path)

Rather than save the entire dataset in the Dataloader is there way to pop out the data have that this be similar to a sklearn pipeline that is there to then use what’s above without the overhead of the memory and large object movement

The code for Dataloader is not working.



I get the same error.

When I try to load the learner using learn.load(.pth file) it gives me an error. I have different data this time which has the same image dimensions but the number of classes in data loader has reduced from 3 to 2.
Can someone tell me how can I load my model for inferencing on a completely new dataset?

learn.load("/content/drive/MyDrive/trained_model",with_opt=True)

This is the error that I get:
RuntimeError Traceback (most recent call last)
in
----> 1 learn.load(“/content/drive/MyDrive/trained_model”,with_opt=True)

1 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
1496 if len(error_msgs) > 0:
1497 raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
→ 1498 self.class.name, “\n\t”.join(error_msgs)))
1499 return _IncompatibleKeys(missing_keys, unexpected_keys)
1500

RuntimeError: Error(s) in loading state_dict for RetinaNet:
size mismatch for classifier.3.weight: copying a param with shape torch.Size([3, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([2, 128, 3, 3]).
size mismatch for classifier.3.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).