How to do fast inference on tabular

Hi, I am having trouble with running inference efficiently on many new records.

After diligently reading docs.fast.ai, I was still unable to solve the following task:

The trouble is that my test set won’t fit into memory, so I have to load it by parts and feed to the learner.

Existing fast way requires using test set in the tabular.data.TabularDataBunch (see usage below)

I can’t figure out how to replace test datasets for inference between iterations.

And predicting rows 1 by 1 is not efficient enough .

test = TabularList.from_df(df_test.copy(), path=PATH, cat_names=cat_names, cont_names=cont_names)


idx_val = sorted(df_train.sample(frac=0.1, random_state=SEED).index)
data = (TabularList.from_df(df_train, path='.', cat_names=cat_names, cont_names=cont_names, procs=procs)
                            .split_by_idx(idx_val)
                            .label_from_df(cols=dep_var)
                            .add_test(test, label=0)
                            .databunch())

learn = tabular_learner(data, layers=[200, 100], emb_szs=emb_szs, metrics=[accuracy], path='.', emb_drop=0.1, ps=[0.5, 0.5])

learn.fit_one_cycle(cyc_len=CYC_LEN, max_lr=LR)

test_predictions = learn.get_preds(ds_type=DatasetType.Test)

If you have 10 tests sets, you can just make a loop over the data creation (using the i-th test set in add_test) then change the data object under your learner with learn.data = new_data before running

test_predictions = learn.get_preds(ds_type=DatasetType.Test)
3 Likes

Sylvain, thanks a lot, it works now!

FYI, for others struggling with this: this did not work in version '1.0.39 (got an error: RuntimeError: running_mean should contain 36 elements not 20)

When I upgraded to 1.0.40dev0 it worked.

1 Like

According to the docs, one should be able to call .predict() on a trained learner passing in a novel dataframe containing your cat_names and cont_names variables. However that doesn’t seem to be working:

    def dl(self, ds_type:DatasetType=DatasetType.Valid)->DeviceDataLoader:
        "Returns appropriate `Dataset` for validation, training, or test (`ds_type`)."
        #TODO: refactor
        return (self.train_dl if ds_type == DatasetType.Train else
                self.test_dl if ds_type == DatasetType.Test else
                self.valid_dl if ds_type == DatasetType.Valid else
                self.single_dl if ds_type == DatasetType.Single else
                self.fix_dl)

The return statement dies with the following error when you pass tabular learner.predict() a pandas dataframe:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Please ignore the above. I was using .get_preds rather than .predict

Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not. It is not clear what the result of.

example

5 == pd.Series([12,2,5,10])

The result you get is a Series of booleans, equal in size to the pd.Series in the right hand side of the expression. So, you get an error. The problem here is that you are comparing a pandas pd.Series with a value, so you’ll have multiple True and multiple False values, as in the case above. This of course is ambiguous, since the condition is neither True or False. You need to further aggregate the result so that a single boolean value results from the operation. For that you’ll have to use either any or all depending on whether you want at least one (any) or all values to satisfy the condition.

(5 == pd.Series([12,2,5,10])).all()
# False

or

(5 == pd.Series([12,2,5,10])).any()
# True