Using another dataset for ClassificationInterpretation

Posting this here so others can find it if they run into this issue:

If you pull the latest dev library and want to run ClassificationInterpretation.from_learner() on another dataset, you must first change your original databunch objects valid_dl or test_dl (whichever you want, a labeled or unlabled dataset) with your new one. Else you will get an error: fix_ds

For example:
data_test is my labeled test set:

Error:
cls = ClassificationInterpretation.from_learner(learn, data_test.train_dl)

No error:

learn.data.valid_dl = data_test.train_dl

cls = ClassificationInterpretation.from_learner(learn, ds_type=DatasetType.Valid)

1 Like

Though actually I am having issues with this. I have one dataset where the error through learn.predict() is 1.58%, but through the above method it is 91.7%. This specifically happens when there is over two different classes, mine has 15. @sgugger do you have a suspicion as to what could be going on here? When I do it on the titanic or the ADULTs dataset it works perfectly fine.

If anyone has a dataset they know the outcomes of and could verify they see the case here too when there is over 2 variables, I would appreciate it

You very likely have a different mapping for your second data_test. Make sure to use your classes from the first data object.

Would this be shown in the order of .unique() in the train and test sets?

Such as:
train[var].unique() vs test[var].unique() being in different orders?

Edit: ah I see what you mean now I believe. There were only 13 different outcomes in my test set vs my training set. Is there a way to override that mapping?

You can always pass classes=my_classes either in a DataBunch factory method or in the labeling step of the data block API.

1 Like

Thanks Sylvain, it still is not quite working. Now it is 50% wrong.

data = (TabularList.from_df(train, path=nbPath, cat_names=cat_var, cont_names=cont_var,
                           procs=procs)
       .split_by_rand_pct(0.2)
       .label_from_df(dep_var, classes=classes)
       .databunch())
data_test = (TabularList.from_df(test, path=nbPath, cat_names=cat_var, cont_names=cont_var,
                           procs=procs)
       .split_none()
       .label_from_df(dep_var, classes=classes)
       .databunch())

Am I generating these databunch objects correctly?

Oh you’re not applying the same preprocessing. You should pass processor=data.processor in the second call to make sure they’re treated the same way (otherwise you don’t map categorical variables to the same indices, and don’t normalize continuous variables the same way).

1 Like

That worked perfectly! Thank you very much Sylvian!!!

If anyone is confused this is how you pass it in:

data_test = (TabularList.from_df(test, path=nbPath, cat_names=cat_var, cont_names=cont_var,
                           procs=procs, processor=data.processor)
       .split_none()
       .label_from_df(dep_var, classes=classes)
       .databunch())
1 Like