How can I get the index of the original observation when loading from DataFrame?

georey · August 22, 2020, 5:47pm

For lesson 1 homework, I have made dataset to distinguish between bees, wasps, and other

after training, I use:

losses,idxs = interp.top_losses()

and I discover that there is some mis-labelled data in my dataset. I try:

idxs[4]

it returns:

tensor(2858)

however, when I go to the original DataFrame:

df_labels.iloc[2858]

I get an unexpected observation:

path                   bee2\P1300-16r.jpg
is_bee                                  1
is_wasp                                 0
is_other                                0
label                                 bee
Name: 2858, dtype: object

This is incorrect; According to the “plot top losses” this should be an “other” and not a “bee”. I have verified by hand that the offending image of “other” is indeed there in the source data – but I need a correct index to it to help in cleaning it up.

I guess that the observations are shuffled during training, which is OK, but how do I get the original index of the observation?

I need that to clean up the source dataset - it indeed contains some mislabelled data.

PalaashAgrawal · August 22, 2020, 7:02pm

@georey
Hey! First off, you are doing a classification problem. You’re using the interpretation class. You should be using the ClassificationInterpretation class.
Anyways! Once you use that class, you’ll probably use the plot_confusion_matrix method to visualize what your model isn’t getting right.
Do tell me if I misunderstood anything.
Hope this helps. Cheers!

georey · August 22, 2020, 7:27pm

Ok, I got it. when calling interp = ClassificationInterpretation.from_learner(learn), the Interpretation() class initializer stores it’s own copy of the dataset in the *.dl property. So I can now say:

losses,idxs = interp.top_losses()
interp.dl.items

and that gives me shuffled DataFrame that has been used to compute the losses during start-up of the ClassificationInterpretation
so that I can take the 5 worst observations this way:

interp.dl.items.iloc[idxs[0:5]]

georey · August 22, 2020, 7:28pm

Yup, I did that, but the question is how to get the actual index and file name and the path of the item that I spot that is mislabelled so that I could make a nice-to-use relabelling tool.

Anyways, I went through the fast.ai source code and I figured it out.