Hi, I am training with tabular dataset that is larger than RAM, so I reload the DataLoaders
on n-th epoch.
As initial setup, I create TabularPandas object with the first load as normal, and then I load a new subset with Callback
after_epoch
.
That is what the callback is doing:
train_ds, valid_ds = prep_sets() #costum func that return new ds
new_train_to = to.new(train_ds)
new_valid_ds = to.new(valid_ds)
new_train_to.process()
new_valid_ds.process()
trn_dl = TabDataLoader(new_train_to.train)
val_dl = TabDataLoader(new_valid_ds.valid)
dls = DataLoaderstrn_dl, val_dl).cuda()
The question:
As a matter of fact, some categories vary from one load to an other in my case.
if I understand correctly, new
method should apply categories dictionary to the new subset, but this seams not to be the case, as new subset columns contain previously unseen categories and learning process does not fail after reload.
Am I missing something? What is the right way to deal with categories in my case?
Is it going to be enough is I initially I create a setup dataset with all the unique categories from the main dataset?
Should I and how can I manually supply with my categories dictionary?
Thank you!