How to do fast inference on tabular

Hi, I am having trouble with running inference efficiently on many new records.

After diligently reading docs.fast.ai, I was still unable to solve the following task:

The trouble is that my test set won’t fit into memory, so I have to load it by parts and feed to the learner.

Existing fast way requires using test set in the tabular.data.TabularDataBunch (see usage below)

I can’t figure out how to replace test datasets for inference between iterations.

And predicting rows 1 by 1 is not efficient enough .

test = TabularList.from_df(df_test.copy(), path=PATH, cat_names=cat_names, cont_names=cont_names)


idx_val = sorted(df_train.sample(frac=0.1, random_state=SEED).index)
data = (TabularList.from_df(df_train, path='.', cat_names=cat_names, cont_names=cont_names, procs=procs)
                            .split_by_idx(idx_val)
                            .label_from_df(cols=dep_var)
                            .add_test(test, label=0)
                            .databunch())

learn = tabular_learner(data, layers=[200, 100], emb_szs=emb_szs, metrics=[accuracy], path='.', emb_drop=0.1, ps=[0.5, 0.5])

learn.fit_one_cycle(cyc_len=CYC_LEN, max_lr=LR)

test_predictions = learn.get_preds(ds_type=DatasetType.Test)

If you have 10 tests sets, you can just make a loop over the data creation (using the i-th test set in add_test) then change the data object under your learner with learn.data = new_data before running

test_predictions = learn.get_preds(ds_type=DatasetType.Test)
3 Likes

Sylvain, thanks a lot, it works now!

FYI, for others struggling with this: this did not work in version '1.0.39 (got an error: RuntimeError: running_mean should contain 36 elements not 20)

When I upgraded to 1.0.40dev0 it worked.

1 Like

According to the docs, one should be able to call .predict() on a trained learner passing in a novel dataframe containing your cat_names and cont_names variables. However that doesn’t seem to be working:

    def dl(self, ds_type:DatasetType=DatasetType.Valid)->DeviceDataLoader:
        "Returns appropriate `Dataset` for validation, training, or test (`ds_type`)."
        #TODO: refactor
        return (self.train_dl if ds_type == DatasetType.Train else
                self.test_dl if ds_type == DatasetType.Test else
                self.valid_dl if ds_type == DatasetType.Valid else
                self.single_dl if ds_type == DatasetType.Single else
                self.fix_dl)

The return statement dies with the following error when you pass tabular learner.predict() a pandas dataframe:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Please ignore the above. I was using .get_preds rather than .predict