This is a duplicate question and semi-long so I will not post here again if it’s too long for this chat. I have this out to Zach on a forum, but I know he is in school and can get busy.
I am currently in the middle of an attempt to get my colleagues to fall in love with Fastai and nbdev, but it’s a battle I am going through
The question on the Dataloaders:
- Can we save the dataloader process without the data because with the data the object is very large I am looking to be able to compete with the Sklearn pipeline.
One thing that can be nice, but also is a pain is that I can only save the data loader with the data until I have a model I don’t have the ability to take the preprocessing steps from a recently saved TabularDataLoader. ( I think this is why I am asking
)
This might be a thing in the DataBlockAPI, but I am currently in a tabular project mode for work.
dl_test = dl_train.test_dl(X_test, with_label=False) # could be true doesn't matter
This is fine when you are going to train and do inference in the same place and have enough ram to hold both data sets. However when using a tabular learn I don’t believe the training data is available and as I write this maybe it is, but I don’t think so.
learn_inf = load_learner(os.path.join(model_path, yaml.get('process_name') + yaml.get('dl_model_suffix')),
cpu=True)
test_dl = learn_inf.dls.test_dl(df_test, with_label=False)
Even though the fastai model is a little bigger than a typical model like an xgb model that is completely okay for the functionality it gives me.
Do you know of a way when
dl_train = (TabularDataLoaders.from_df(df_transform, procs=procs,
cat_names=cat_vars, cont_names=cont_vars,
y_names=0, y_block=y_block,
valid_idx=splits[1], bs=bs))
if os.path.exists(p) is False:
os.makedirs(f'{p}')
logging.info(f'{fn} getting saved to {p}')
file_path = os.path.join(p, '' f"{process_name}_{fn}.pkl")
logging.info(f'file saved to {file_path}')
torch.save(dl, file_path)
Rather than save the entire dataset in the Dataloader is there way to pop out the data have that this be similar to a sklearn pipeline that is there to then use what’s above without the overhead of the memory and large object movement