Relate original file name to a specific sample

When using the Dataloader API, I’m often encountering badly labeled data (when using the very convenient show_batch(), or show_results(), or plot_top_losses() ). When I see this, I want to know the source image file with the labelling problems, as it may indicate that more image types in its vicinity are corrupted.

I tried to modify the specific functions to do it, but got stuck as I’m not sure the original file’s data even exist. Anybody has an idea how I would do that? How can I keep track of the original files that created the samples in the dataloader?

EDIT 1:
I found these answers from previous versions:

EDIT 2:
Using the answer above I managed to relate the filename by taking the idxs of the top losses:
losses,idxs = interp.top_losses(10)
and getting this list
[ 493, 209, 711, 93, 862, 1082, 226, 708, 864, 111]
the interesting (badly labeled) image is the last one, so I use:
dls.valid_ds.items[111]
and get the filename.

The weird part is that when I inspect the image under the given filename that I obtained, I don’t see the same image as the last image in interp.plot_top_losses(10)! Actually, none of the images in the top losses plot correspond to the images obtained in this method.

EDIT 3:
My mistake! I used a different dataloader object to get the file names. So all is good and the method above solved my problem. Maybe I should leave this post here in case someone else needs to relate the file names to the results?

5 Likes

most folks be using

dls.valid_ds.items.

i spent a lot of time troubleshooting this too. You should use the data loader from the learner instead.

learn.dls.valid_ds.items.