02 production - cannot use cleaner to clean data set

Hello,
I manage to train de ResNet18 as suggested. I get 2% classification error or so. I then try to use the cleaner

cleaner = ImageClassifierCleaner(learn)
cleaner

and retrieve the indices of the images to be moved/deleled.
I then run

for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx, cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)

I then rerun

bears = bears.new(
item_tfms=RandomResizedCrop(224, min_scale=0.5),
batch_tfms=aug_transforms())
dls = bears.dataloaders(path)
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

And when inspecting the highest losses I find that the images I deleted/reclassified are still in the dataset. So my model does not improve. How can I properly move/delete the images in the dataset ? The book just says ‘retrain the model until and see if your accuracy improves’ but I did not succeed.

Hi @tonio,
I haven’t used this myself yet. But my guess is, that cleaner.fns[idx].unlink() removes the link to the file in your bears object. And when you run bears.new(..) you list the unlinked ones again. Try it without bears.new(..) and see if this helps.

The changed ones are a mystery, thoug. They should not show up with the wrong class, since they we’re actually moved on disk.

Cheers

Hello @JackByte, thank you for your answer. I tried but it did not work either. I moved on with the course until we use the again to gain more insight. Thanks anyway.

I am getting an error when I run:

for idx in cleaner.delete(): cleaner.fns[idx].unlink()

I created a notebook on Kaggle. This is the error im getting:


OSError Traceback (most recent call last)
in
----> 1 for idx in cleaner.delete(): cleaner.fns[idx].unlink()

/opt/conda/lib/python3.7/pathlib.py in unlink(self)
1307 if self._closed:
1308 self._raise_closed()
→ 1309 self._accessor.unlink(self)
1310
1311 def rmdir(self):

OSError: [Errno 30] Read-only file system: ‘/kaggle/input/bears-dataset-fastai/bears/black_bear/th (1).jpeg’

Any idea how to fix this?

Hi @tonio, I was having a slightly different problem with removing the unwanted pictures and I think I figured out your issue instead.

To understand what is going on the book shows us what get_image_files does by having us store the values in a variable to play with.

fns = get_image_files(path)

Later on when we define the parameters of our DataBlock (bears), the get_items is set to the function get_image_files, not the variable fns, which would ultimately be the same thing.
That is, until we run these

for idx in cleaner.delete(): cleaner.fns[idx].unlink()

for idx, cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)

I am not certain because my own issue is preventing me from testing this, but I think that you are removing the links to the pictures you want to remove from the fns list, but your DataBlock is still calling get_image_files, which is then grabbing all the files in the folders. Try putting get_items=fns in your DataBlock and see if that helps.

Hi @Rezkin, thats something that needs a workaround in kaggle. So the input folder is read-only, you can’t remove stuff here. Think of it as a shared folder that is mounted.

However, you can copy the content (with some size limitations) to your working directory. And in there you have full power :muscle: :muscle: :muscle:

1 Like