VGG, strength and limitations?

I didn’t find any discussion on VGG’s history, strength and limitations, so I thought I’d start one in the hope that folks with more experience can share more insights.

First of all, I am astonished by how influential VGG is, their original paper published in april 2015 has been cited 3177, that’s on average 3+ papers per day citing this paper in the last 2 years!! This metadata seem to be indicate that increased depth in network configurations is shaping how people think about deep learning and building deep learning models. If that is the case, can we say the deeper the better? If not, is there such a thing as optimal depth? Why stop at 19 (as in the original paper)? What are other big ideas in deep learning in addition to going “very deep”?

Second part of my question is about VGG’s application in practice, when does it work really well, when does it not work so well? Since most folks here probably tested it on their own datasets, I am curious if people care to share their own experience.

Finally, just to get some context in training time expectations. In their submission to 2014 imagenet, Simonyan and Zisserman explained “Our implementation is derived from the Caffe toolbox, but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. Training a single ConvNet on 4 NVIDIA Titan GPUs took from 2 to 3 weeks (depending on the ConvNet configuration).” I for one am very grateful to Jeremy’s guidance on starting with sample data.

You can find the original paper here

Imagenet2014 [results] (


VGG was great for the results it attained back in 2014. Its still taught and used because its simple to explain.

What are the cons?

  • It has so many weight parameters, the models are very heavy, 550 MB + of weight size.
  • Which also means long inference time

Why not just make the model deeper?

  • More heavier model
  • More training time
  • Vanishing gradient problem
  • But, this may suprise you, deeper networks can have higher test error and generalize lesser if done simply like what VGG does.

Kaiming He and Jian Sun, who invented Resnet (Residual Networks), broke 5 records including classification, localization and others in 2015. Majority of CVPR conference papers in 2016 were using Resnet architecture. This allowed them to make 152 layer neural networks.

Resnets take lesser memory, faster inference time, and allows deeper networks to be trained. Based on your problem, you can decide how many layers you want for the required accuracy and inference time requirements. Its rather simple to understand too.

Here’s the details and pretrained models. Most deep learning packages do have pretrained models available in their model zoos.

I asked Kaiming why did he train a 152 and not 153 layers. He said his GPU card ran out of memory :slight_smile:


Thank you, Anirudh! Your insight is packed with so much great information, it’s super helpful in accelerating my learning and getting the big picture of AI research. When you mentioned deeper networks can have higher test errors, can you expand on that? why is that the case and how are people fixing that problem currently?

Also, thank you so much for sharing your perspectives on VGG. I’ve only used it on my test data, and my models are still very manageable. Now I know the bulky model size for full training, I can plan ahead and optimize my workflow accordingly.

So glad you mentioned resnet. I just came across some amazing resnet results recently, now I am super curious to learn more about it. Have you personally used resnet before? what do you like and not like about it? Forgive me if this is too many questions, I am pretty psyched about the rapid development from 19 to 152 in such a short time!

Btw, if anyone’s interested, here are Kaiming He’s resnet tutorial at ICML and the companion lecture notes.

1 Like

Discussion on Training/Testing error is higher in a simple deep convolutional neural network than a shallow network:
See 32:00 in the video ( ) or slide 28 ( ) . This is one of the phenomena that residual network is trying to fix.

What do I like about it?
Because the architecture is simple, I can choose how many layers my network needs to have based on my use case. For example, to run it on mobile, I can benchmark how much time is taken to run on different depths and then choose the best depth which runs within limitations on time to run and accuracy. Of course, pretrained networks of 18/34/50 are already available, so its a good start.

For eg, on an iPhone 7, I might choose Resnet 34, on iPhone 6, Resnet 18, and on my server, a Resnet 50.

Lastly, there have been papers talking about the wattage/parameter/speed required. Microsoft’s Cognitive Services apis had a Resnet 50 inference image tagger which runs in under 150 ms on a CPU for 2000 tags. That was an eye opening figure back when it was published.

With the pretrained ‘Inception’ architecture, reducing/increasing the network is a lot more work.


Your explanation on choosing network layers for customized use case is fantastic!! This is the first time I learn about the practical concerns of deep learning use case in app production pipeline, amazing stuff! (or dare I say super cool!)

I only covered the first 35 minutes of the talk, it’s very clear and I really like the simple explanation of exploding/vanishing gradient.