Hey my question is in Cleaned.csv everything is included rather than images of each category in each separate file. In this case, how can we recreate ImageDataBunch for this cleaned.csv file?
Kaggle now seems to be using latest version and so from_similars works fine. Thanks!
You can create the ImageDataBunch from the csv, like so (assuming your csv is in folder named âdataâ):
data = ImageDataBunch.from_csv('data',csv_labels='cleaned.csv',label_col=1,valid_pct=0.2,ds_tfms=get_transforms(), size=224, num_workers=0).normalize(imagenet_stats)
When I run the code ImageCleaner(ds, idxs, path) in Google Colab, it just keeps on running and never ends, I donât see the images with the option to delete either. Is it only me, or is the problem that I am running it on Colab ?
Google colab will not run widgets so you cannot use image cleaner within it.
I have implemented my code like this in Google colab, to clean my images>Actually my error rate is only 1.7% only, but actually i am new to data cleaning and want see how to implement that.
db = (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(get_transforms(), size=224)
.databunch()
)
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)
learn_cln.load(âstage-2â);
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)
After executing this all snippets perfectly. When I am try to execute the below statement there Its taking so much time and also its not responding to any interrupts. Can anyone tell me is the below statement corrcet or am wrong anywhere which is leading to that.??? Do help me.
ImageCleaner(ds, idxs, path),
@muellerzr You posted like Colab will not implement widgets so we cant do image cleaning in that Is that correct.??
Yes, itâs a security thing with google. They will not allow external widgets to be used.
Ok thanks @muellerzr , Actually i dont have any gpu actually where can i do that enabling onlne free source gpu can suggest me any good open source like colab which allows widgets.
I do not. I donât usually clean my dataset post-training unless it was downloaded automatically, as having âtoo cleanâ of data can be bad in the sense of relying on the models wrong, deleting them, then trying again you wonât actually improve much if your deleted images were actually good images. Does that make sense? In regards to the GPU, no. I do believe you can find free credit codes for PaperSpace here on the forum if you do enough digging though.
Thanks for the solution.
I just started the course last week and now the material has been updated to this and the video actually de-sync to this new widget. Since Iâm using Google Colab Iâm not allowed to run right.
What should I do? Skip this section to deployment? Will I be missing a lot of things?
Thanks
Just skip the section on the widget in steps. But listen to why. The concepts of data cleaning is important. Itâs a tool to do it but there are other ways and itâs important to know why.
@muellerzr WOW the reply is fast.
So in real world⌠we should actually clean after you download right?
I check my emails often In the real world yes, you have two options. Clean before download (preferable and almost always the best) but if itâs too much to go through, you can quickly use paperspace or another provider, even your own local machine if you can, and run the imagecleaner to analyze if your top losses were real photos or not, and delete them that way. ImageCleaner helps make that process less, well, awful. But itâs certainly not the only way.
Oh I got youâŚmeaning I use jupyter to run it since Colab canât do so
and upload accordingly laterâŚ
That makes sense.
Exactly! No problem, glad I could help.
I am using widget to clean the images, i have all data as training set (4107 items). When i ran DatasetFormatter().from_similars(learn)
I still end up with a list of indexes thatâs twice as l the dataset and just matching each image with itself:
ds, idxs = DatasetFormatter().from_similars(learn)
len(idxs_ giving me 8214
but ds is giving me 4107 items only. Is the indexes doubled.
I am using widget to clean the images, i have all data as training set (4107 items). When i ran DatasetFormatter().from_similars(learn)
I still end up with a list of indexes thatâs twice as l the dataset and just matching each image with itself:
ds, idxs = DatasetFormatter().from_similars(learn)
len(idxs) is giving me 8214 items in ist
but dataset (ds) is giving me 4107 items only.
Is the indexes doubled. can anyone please clarify? when i run the ImageCleaner, it keeps showing duplicate images, so wondering whether the images are getting compared with itself