Lesson 2 - Creating your own dataset from Google Images

Hey, your Kaggle really helped me. Thank you so much!!!

Suppose one used ImageCleaner once to remove top losses and created a new data bunch using the cleaned.csv. Then one wants to use the ImageCleaner again to remove duplicates. The question I have is: will the new cleaned.csv reflect the only images that survive both top loss and duplicates?

Thanks!
Tim

Allowing me to answer my own question: the new cleaned.csv contains only the images that survive successive rounds of ImageCleaner.

:slight_smile:

I created an image dataset of three kinds of metalworking machinery. Lathes, surface-grinders and milling-machines. After stage 2 there was a 21% error rate - which seemed pretty miserable. When I went through the ‘delete irrelevant images’ section of the notebook, there were surprisingly very few errors. Amazing really. The confidence levels seem pessimistic.

When reached the step where we are remove duplicate images. There was some sort of combinatorial explosion where every surface-grinder seemed to be flagged as a duplicate of every other surface-grinder. Interestingly none of the flagged duplicates was in fact a duplicate. I could tell the difference pretty trivially. They certainly were similar, but obviously, trivially different - at least to bipedal monkeys.:slight_smile:

Can anyone explain to me what is going on?

Cheers…

Hi banksiaboy Hope all is well!

Do you have any image examples of the classes that you describe in your post.

How images do you have in each class?

Do you have a copy of the confusion matrix?

Cheers mrfabulous1 :grinning: :grinning:

Hi @mrfabulous1, Thanks for showing an interest - much appreciated…

Sample of the unfiltered downloaded pictures:

Counts of subdirectories of ./data/ showing number of image files in each class

$ find . -type f | cut -d/ -f2 | sort | uniq -c
      1 cleaned.csv
      1 export.pkl
    160 grinders
    160 lathes
    160 mills
      3 models
      1 urls-metal-lathes.txt
      1 urls-milling-machines.txt
      1 urls-surface-grinders.txt

Confusion matrix:

Screenshot from 2020-06-25 09-45-43

I must say I’m still uncertain interpreting the output of the learning executions…

Cheers…
:slight_smile: :slight_smile: :slight_smile:

21% is not miserable when you train a model, as data can be far worse than that, in some algorithms the dataset is such the the algorithm may never converge to produce a result that is usable.

I showed your images to 3 random people and none of them could identify any of them, with any confidence. You are undoubtedly a domain expert unlike us bipedal monkeys.

Your model is exhibiting the behavior I see many models make, when the images look very similar across classes.

Although I used these machines at school and college
I can see no discerning differences between the three classes.

e.g. when you look at a cat or a dog, there is something catish about a cat and doggish about a dog. Which everyone can see even if they cannot articulate it.

Your images nearly all contain handles a bed for the material a control box etc.

I had the same problems when building classifiers for certain data-sets e.g when building a wristwatch classifer the error rate was approx 6% percent for two classes unusable with 60 classes.

I think this is the point where in kaggle competitions, people start using different models, tweeking the code and doing feature engineering to tune the model some more.

I have been wondering myself when I have this problem do I need more data eg. would 1600 or 16000 images a class help?

Hope this helps

Cheers mrfabulous1 :smiley: :smiley:

2 Likes

Hi, Thanks for the wonderful course. I am new to machine learning / deep learning.

Whey trying to download images using javascript code, I am only getting the google link like the one below for all images.

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSbV4MZb5rhAJVU8vBDdmCXtJ5SsLHaDYOsZg&usqp=CAU

Is there a way to resolve this, please.

1 Like

Hi there,
I’m dabbling with a classifier for airliners. I downloaded ~400 images of 13 aircraft types and used a pretrained resnet50 as classifier. I’m currently reaching 75.9% accuracy.

I’m wondering about two things. First, when I look at the confusion matrix, I see that the errors are not evenly distributed:
confusion_matrix
For Airbus A319/A320 that totally makes sense, these aircraft look extremely similar and are hard to discern even for me in many pictures.
For Airbus A330/A340 however, although they share a lot of features, it should be nearly impossible to confuse them, as one has two engines, the other four.
What could I do to fix this? Just throw more data/images at it? @mrfabulous1, you mentioned feature engineering, is that a viable approach for such a problem, and how would it work?

My second thought: when I plot the top losses, I get this result:


How can the loss be so high, if the predicted category is correct?

Thank you so much,
Johannes

Hi @aditya_new,

I believe Google changed their search results page a while ago, which makes it a little harder to scrape the images. I don’t know how to fix this directly, but I can recommend a nice tool for scraping Google/Baidu/Bing/Flickr. It’s called fastclass.
This runs on your local machine, however, so you will have to zip the images and upload them to your fastai environment (if you’re running it on a cloud provider).

Have fun!
Johannes

1 Like

Thank you Johannes. Appreciate your help.

Hey @mrfabulous1,
Ha - If you showed the pics to 3 machinists they’d know straight away - I’m not one - but grew up amongst them. Its funny - there were very few classification mistakes. It seems that the system struggled to differentiate individuals but was very good at classification.
Grinders all have a grinding wheel in-line with the moving bed, and a tray with ends to catch the waste. lathes have a horizontal axis, milling machines have a vertical axis and a tool-holder/chuck not a grinding disk and no ends on the tray to catch sparks/waste. But the rest of the bits and bobs are very similar between the mills and grinders. And all cats and dogs have ears and eyes and fur mostly :slight_smile:

Its very reassuring what you say. I feel encouraged to keep going!
Feature engineering must be the art.

Cheers, and many thanks!

–me

1 Like

@mrfabulous1 - check this, I tried again with resnet50 on the filtered results. Pretty good i think. Must have been too lo-res for all that confusing detail…

Cheers…

1 Like

Hi banksiaboy well done!

It looks like a good step if one experiences this type of problem is to try a stronger/model.

I’ll keep this in my repertoire.

Cheers mrfabulous1 :smiley: :smiley:

1 Like

Hi johannesstutz hope your having a jolly day!!

For a human maybe but not for an ai model :laughing: :laughing:
You could also try using an even bigger resnet model.

I would definitely use more images I always use a mininum of 200 per class for my first benchmark model.

I have added two links on feature engineering which I found quite helpful.

Have a jolly day :smiley: :smiley:

1 Like

Hi and thanks mrfabulous :slight_smile:

I’ll keep experimenting with different models, augmentations, resolutions, …
I have around 400 images per category, maybe I’ll add some more and see if it helps.

Regarding feature extraction, thank you for the links. It makes sense to me on tabular data. But I’m not sure how to apply this in computer vision :smiley: For example, method 3 in the second link, edge detection. Isn’t that something that a CNN should figure out for itself, via learnt (or pretrained) kernels?

Hi

johannesstutz hope all is well!

The paper below is one that Jeremy uses. in one of the video’s. I think its wonderful.

The paper above gives an insight into what CNN’s are doing. There may indeed be a layer that is highly correlated to the characteristic of lathes and grinders etc.

Currently whatever one does to improve your network, its a little like a dark art, in that unless your accuracy metric goes up or down one really doesn’t know much about which layers are being used the most. But I would postulate if you could apply this paper to your network and then make changes and see what actually happens. The results would be highly interesting.

I have seen people choose use a specific layer before they fine tune.

I wonder what the inside of your model looks like.


Image from above paper.

Cheers mrfabulous1 :smiley: :smiley:

1 Like

Hi,
I’m a novice. Need some help please.

I have downloaded bear images from google, saved them in specific folders (black, grizzly & teddy). Renamed the files as blackbear1.jpg, blackbear2.jpg etc in the respective folders.

When running the below code:

np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

I am getting 2 issues (IndexError and Exception) listed below.

IndexError : index 0 is out of bounds for axis 0 with size 0

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last) in

    1 np.random.seed(4) 

----> 2 data = ImageDataBunch.from_folder(path, train =".", valid_pct=0.2, ds_tfms = get_transforms (), size=224, num_workers=4).normalize(imagenet_stats)

Exception: Can’t infer the type of your targets. It’s either because your data source is empty or because your labelling function raised an error.

Kindly suggest what could be the problem.

Hey harsh did you ever find a solution to this? I am having the same issue.

Hi all, finally got my first ML web app up and live and wanted to share some of my experience briefly. The app and model themselves aren’t really very special, but I’ll share them here.

(Resnet 34 model trained to identify if a mountain is Mt. Rainier [Washington State], Mt. Hood [Oregon], or Mt. Fuji [Japan]) - Give it a try!
https://mountainid-test.onrender.com/
https://web-app-early.wl.r.appspot.com/

For me, the model building was straightforward. I was able to work through the lesson 1 and 2 notebooks without issue to download images and train my model. The accuracy was ‘ok’ at 80% and I didn’t really focus much on improving it beyond the steps in the lesson. I then both saved (.pth) and exported (.pkl) my model - this becomes important later.

From there, I basically followed the ‘Production’ instructions for the Google App Engine and I found them to be again mainly straightforward. I hit a big snag once I started trying to deploy the model though. I kept getting a massive error with tons of lines of code as it tried to run script.py. I did a bunch of debugging including:

  • Deploying on Render (didn’t work)
  • Trying to run the original example script.py (that worked)
  • Checking that my model_file_url worked by typing into my browser and seeing if it downloaded directly (it did)

I was starting to sweat a little bit and frantically search through the forums at this point. My eureka moment was when I downloaded the model for the original example from Dropbox and it came back as “stage-2.pth.” That is when I realized that I was using the export .pkl instead of the .pth files.

Did I miss something in the instructions? I was under the impression that we should be using the exported model instead of the saved one.

In any case, that cleared things right up and my web app was up and running. I know this was a long post/story but I’m hoping that my experience can help give someone a kickstart if needed.

1 Like