Carvana Resnet34 vs. Vgg16

I’ve been working on comparing the performance of resnet34 to vgg16 on the Carvana image segmentation challenge.

Running out of memory

I ran into a few roadblocks in getting the resnet34 model to train complete epochs for the 512x512 and 1024x1024 images. It turned out that the way Python handled multi-threading in version 3.6 changed to greedily loading batches into memory from loading them lazily in 3.5. The result was the consumption of over 70GB of system memory. Jeremy’s fix was to chunk the batches together limiting the size of the chunk based on the number of workers. After the fix, with 8 workers, training consumed about 10GB of memory.

Comparing model performance

I did a little refactoring of the carvana.ipynb notebook to make it easier to run the same model with different image sizes and to be able to change up the transfer model easily.

Having done that, I trained models with 128x128, 512x512 and 1024x1024 sized images using both Resnet34 and Vgg16. The hypothesis was that Vgg16 would perform better than Resnet34. Here is the notebook for the Vgg16 based model.

The table below shows the dice coefficient at the end of the training runs for each image size. Training for the larger images sizes started with the weights from the models trained on the smaller image sizes. The model was trained with the 128x128 images in batches of 64 for 25 epochs. It was trained with the 512x512 images in batches of 16 for 13 epochs. And lastly, it was trained with the 1024x1024 images in batches of 4 for 22 epochs.

Image size Resnet34 Vgg16 % Error Impr.
128x128 0.963900 0.980969 47.3%
512x512 0.989929 0.993181 32.3%
1024x1024 0.983007 0.996000 76.5%

As expected, the vgg16 based model performs much better than the resnet34 based one.

Next steps

My next step was going to be to dig in to the architectures of the models to try to identify reasons the vgg16 based network performed so much better. For starters, vgg16 is a much bigger network, with something on the order of 5x as many parameters.

This was an interesting short exploration of the difference in performance between various popular architectures.

Any suggestions on key areas to focus on in identifying why vgg16 performs better than resnet34 would be much appreciated.


One question I have about the comparison is—how did you decide on the number of epochs to train for, and are you sure that both models are reaching (or getting close to) convergence? One thing that comes to mind is that, since resnet is a deeper model, it could take more epochs to train, but might be capable of getting better performance than vgg.

FYI there’s some more progress been made on this issue in the SF study group. If you have a chance to come by we can discuss in person. Otherwise we can share some notes.

I’m stuck down in Orange County, but would love to stay involved.

I was trying to use the same number of epochs for comparison purposes, but get your point that the optimal number is likely different given the different architectures. I can keep cranking them as is, but noted that I was still a good ways off from a good result on the leader board and was thinking about adding in the unet architure so as to be shooting for the best possible end result.

I think just using upsampling can’t get any good result. After Resnet/VGG, information is quite compressed, and upsampling layers simply doesn’t have enough information to reconstruct the mask.
I guess VGG has more parameters, it can remember patterns of each part more or it simply might have more features at the end of VGG layers than Resnet’s

Anyway, without architecture like UNET, there is not much point of benchmarking.
But I’m interested in using resnet though my gut feeling is that the images are quite uniform and the goal is also simple, so too many layers in resnet doesn’t help much.

BTW, you wouldn’t get a good score without using full-resolution - (I think somebody put analysis in the forum about theoretical score limit with scaled-down images.) And to train the full-resolution image, you’ll need extra tricks.
Check my earlier post for summary of approaches from winning teams:


Yeah so we found that the VGG difference still appears in Unet. Another trick is to create a Unet cross-connection for the input pixels as well, not just the intermediate activations.

I believe that the cause of the issue is that VGG is trained with fully connected layers, that have access to the 7x7 grid geometry. Resnet throws this away by averaging it. So our next step is to try training Resnet on Imagenet replacing the average pooling with flattening and then adding the same fully connected layers as VGG.

Great, thank you both. That gives me lots to go on. Will take another turn and revert in the next day or so.

I was taking 25 minutes per epoch on the P5000 paperspace machine. Is that in the ballpark?

The winners of the Carvana challenge wrote up a paper and published their code on github, which includes vgg11, vgg16 and resnet34 based unet architectures.

TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation
MICCAI 2017 Robotic Instrument Segmentation


What would be the best way to try using these models with the fastai library?