Duplicate Widget

Hey my question is in Cleaned.csv everything is included rather than images of each category in each separate file. In this case, how can we recreate ImageDataBunch for this cleaned.csv file?

Kaggle now seems to be using latest version and so from_similars works fine. Thanks!

1 Like

You can create the ImageDataBunch from the csv, like so (assuming your csv is in folder named ‘data’):
data = ImageDataBunch.from_csv('data',csv_labels='cleaned.csv',label_col=1,valid_pct=0.2,ds_tfms=get_transforms(), size=224, num_workers=0).normalize(imagenet_stats)

6 Likes

When I run the code ImageCleaner(ds, idxs, path) in Google Colab, it just keeps on running and never ends, I don’t see the images with the option to delete either. Is it only me, or is the problem that I am running it on Colab ?

Google colab will not run widgets so you cannot use image cleaner within it.

2 Likes

I have implemented my code like this in Google colab, to clean my images>Actually my error rate is only 1.7% only, but actually i am new to data cleaning and want see how to implement that.
db = (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(get_transforms(), size=224)
.databunch()
)
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)
learn_cln.load(‘stage-2’);
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)
After executing this all snippets perfectly. When I am try to execute the below statement there Its taking so much time and also its not responding to any interrupts. Can anyone tell me is the below statement corrcet or am wrong anywhere which is leading to that.??? Do help me.
ImageCleaner(ds, idxs, path),
@muellerzr You posted like Colab will not implement widgets so we cant do image cleaning in that Is that correct.??

Yes, it’s a security thing with google. They will not allow external widgets to be used.

1 Like

Ok thanks @muellerzr , Actually i dont have any gpu actually where can i do that enabling onlne free source gpu can suggest me any good open source like colab which allows widgets.

I do not. I don’t usually clean my dataset post-training unless it was downloaded automatically, as having ‘too clean’ of data can be bad in the sense of relying on the models wrong, deleting them, then trying again you won’t actually improve much if your deleted images were actually good images. Does that make sense? In regards to the GPU, no. I do believe you can find free credit codes for PaperSpace here on the forum if you do enough digging though.

Ok thanks @muellerzr, Very glad for your reply.

Thanks for the solution.

1 Like

@muellerzr

I just started the course last week and now the material has been updated to this and the video actually de-sync to this new widget. Since I’m using Google Colab I’m not allowed to run right.

What should I do? Skip this section to deployment? Will I be missing a lot of things?

Thanks

Just skip the section on the widget in steps. But listen to why. The concepts of data cleaning is important. It’s a tool to do it but there are other ways and it’s important to know why.

2 Likes

@muellerzr WOW the reply is fast.

So in real world… we should actually clean after you download right?

I check my emails often :wink: In the real world yes, you have two options. Clean before download (preferable and almost always the best) but if it’s too much to go through, you can quickly use paperspace or another provider, even your own local machine if you can, and run the imagecleaner to analyze if your top losses were real photos or not, and delete them that way. ImageCleaner helps make that process less, well, awful. But it’s certainly not the only way.

1 Like

@muellerzr

Oh I got you…meaning I use jupyter to run it since Colab can’t do so
and upload accordingly later…

That makes sense.

Exactly! :slight_smile: No problem, glad I could help.

1 Like

I am using widget to clean the images, i have all data as training set (4107 items). When i ran DatasetFormatter().from_similars(learn) I still end up with a list of indexes that’s twice as l the dataset and just matching each image with itself:

ds, idxs = DatasetFormatter().from_similars(learn)
len(idxs_ giving me 8214

but ds is giving me 4107 items only. Is the indexes doubled.

I am using widget to clean the images, i have all data as training set (4107 items). When i ran DatasetFormatter().from_similars(learn) I still end up with a list of indexes that’s twice as l the dataset and just matching each image with itself:

ds, idxs = DatasetFormatter().from_similars(learn)
len(idxs) is giving me 8214 items in ist
but dataset (ds) is giving me 4107 items only.

Is the indexes doubled. can anyone please clarify? when i run the ImageCleaner, it keeps showing duplicate images, so wondering whether the images are getting compared with itself

Hey, what version are you using? Are you using it like indicated in the docs?