Why Data Quality is Important | A Brief Overview


I recently created a post where I write about why data quality is important, at a high level overview, after my experience in creating my own classifier that classified images of cars into their brands.

Despite using 101 layers in my neural network, 20 epochs, and 5000 images, I still had an error rate of 17.4%.

You can have a read through my post here:

If you have any comments, questions, corrections, or suggestions, please do post them!

I would recommend you to add a link to your post with the source code and data you used to reproduce those results. It would be hard for anyone to comment on those results otherwise.

Generally speaking, to narrow down where your model is struggling, you should do some error analysis. Select those images in the validation set that your model gets wrong and takes note of what you think caused the model to misslabel it — e.g., multiple cars, car only partially included, rain, etc.

Depending on what you discovered doing error analysis you might have different routes to improve your model.

I also wouldn’t jump directly to ResNet101 with a dataset of that size. But again, it is hard to comment not knowing what modeling decisions (e.g., data augmentation, loss, architecture) you made and how the data look like.

Good effort though! That’s an interesting problem. I would assume the model could just focus on the brand logo to separate different car manufacturers. Perhaps use GradCam or similar tool to visualize what part of the image the model focuses on.

1 Like

Hey! Thanks for the comment and feedback!

My model is based on what I have learned so far from Chapter 2 of the fastai course, which is why I jumped to ResNet 101 :smile:. I did it incrementally, though. I started with ResNet18 then ResNet24 and so on till ResNet 101. With ResNet18, I had an error rate of about 24.4%. Not a substantial improvement.

I’m not able to share the dataset since it is large in size, but if you want to try to dabble in the same vein, you could also scrape images of DuckDuckGo. You can view the code at the bottom of this reply. I should remember for next time to link the code as well.

I classified the following car brands:

  • Mercedez Benz
  • Audi
  • Aston Martin
  • Ford
  • Lamborghini
  • Ferrari
  • Chevrolet
  • Alfa Romeo
  • Renault
  • Jaguar

That’s a good approach for error analysis: look through the photos with the highest loss and see what’s common in them. I’ll keep that in mind for next time.

I thought that the model would overfit by focusing on the car logos! But that was not the case. However, I don’t quite think that creating a model that recognizes car brands based on the logo would be a good idea as the model would not be able to generalize to the same car differently positioned. For example, the side profile of a car would not have the logo visible.

Is GradCam easy to use? I have not delved too deep into creating models yet, but I could give it a shot if it is simple enough and I’ve got the time.

Relevant portions of model Code:

# Datablock.
cars = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    item_tfms=RandomResizedCrop(224, min_scale=0.5),

# Dataloaders.
dataloaders = cars.dataloaders(path, bs=32)

# Learner.
learn = cnn_learner(dataloaders, resnet101, metrics=error_rate)

Hi @ForBo7 , you can share the notebook and data (as a custom dataset) in Kaggle I think.

Hey Mike!

Oh yes, I can! They have a surprisingly large free quota too: 107.37 GiB. I’ll try to upload the dataset there then.

Apologies for the late reply. I was on holiday and spent very little time checking the forum/internet.

I didn’t mean to suggest we should make the model focus on the car logo. As you mentioned, that might create issues in situations when the logo is not visible. What I was trying to explain is that error analysis and grad-cam can help us understand what the model is focusing on, which gives us a better idea of how to improve our model. For instance, if the model is focusing exclusively on the logo, maybe we could try to add more images where the logo is not visible, or even obscure it intentionally in some pictures.

pytorch-grad-cam is quite easy to use. Here is a Colab notebook showing how to use it with fastai.

I would suggest looking at every misclassified image, not just those with a high loss. The reason is that what we care about is the metric, not the loss. The loss is just a “reward” function we use to give some feedback to the model regarding how it is doing so that it can learn how to improve.

Now, if we want to make a large impact on the metric, we need to understand what is the most frequent scenario where the model is wrong. For instance, let’s say this is a summary table showing what we learned from our error analysis.

count, scenario
50, foggy
5, multi-color
2, no background
2, person partially obscuring car

In this case, it is quite clear that if we find a good way to improve the performance of our model on images taken in foggy weather, we could improve our metric by a wide margin. And that would be where we should focus first.

One situation where looking at pictures with high loss makes sense is to discover miss-labeled data. Jeremy showed that in one of the lessons.

1 Like

Thank you for the information!

You’ve shown an approach which I had not quite considered too deeply: check what features the model is focusing on and what images in specific the model is having trouble with. I’ll be sure to try using this mindset more often.

You also reminded me of the subtle, yet key difference between loss and metric :slight_smile: .

I’ll check out GradCam too with the resources you’ve provided.

I’ve remade my blog in Quarto, so the link above will be broken.

You can view the same article with the following new link: