Tabular: categorical classes for out-of-memory datasets

Hi, I am training with tabular dataset that is larger than RAM, so I reload the DataLoaders on n-th epoch.
As initial setup, I create TabularPandas object with the first load as normal, and then I load a new subset with Callback after_epoch.

 That is what the callback is doing:

    train_ds, valid_ds = prep_sets() #costum func that return new ds
    new_train_to = to.new(train_ds)
    new_valid_ds = to.new(valid_ds)
    new_train_to.process()
    new_valid_ds.process()
    
    trn_dl = TabDataLoader(new_train_to.train)
    val_dl = TabDataLoader(new_valid_ds.valid)
    dls =  DataLoaderstrn_dl, val_dl).cuda()

The question:

As a matter of fact, some categories vary from one load to an other in my case.

if I understand correctly, new method should apply categories dictionary to the new subset, but this seams not to be the case, as new subset columns contain previously unseen categories and learning process does not fail after reload.

Am I missing something? What is the right way to deal with categories in my case?

Is it going to be enough is I initially I create a setup dataset with all the unique categories from the main dataset?

Should I and how can I manually supply with my categories dictionary?

Thank you!

That would be the way to do it, even just one example of each category. This way they’re all preprocessed the same and the model can be generated the same as well.

How can I manually supply with my categories dict?

There isn’t a straightforward way right now from what I can see. I would recommend opening a feature request issue on the github: https://github.com/fastai/fastai/issues

In the meantime, a workaround would be to make a base DataFrame that contains all the unique values you expect, and preprocess it to make a “base DataLoader”, and work from there.

thank you, Zachary!