Understanding DL Models and their Architectures (i.e. Renext50, Inception, Cifar10)

Up to now, I have been experimenting with using the different models available in the Fastai library during the course and the Kaggle competitions on an ad-hoc basis but as we approach the last week of the course, it would be great to understand more about their architectures and their best use cases(CNNs, RNN, structured data). ImageNet based models and their levels of complexity are easier to understand, others are not so clearcut. Are you finding that some models perform better on certain projects than others or are you building your own models from scratch for certain problems.


Here is some intuition I’ve gathered so far so good:


  • The ResNet-18, Resnet-34 models have fewer parameters, thus they’re less likely to overfit if you have fewer training datasets. If you’ve got 10’s of thousands, feel free to use VGG-16, Resnet-101 and so on.
  • Generally for classification, Resnet are considered the state-of-the-art. So always try one of the ResNet versions out.
  • The pretrained VGG-16 on imagenet, strangely, is much adopted for for many other image tasks such as image segmentation, object localization etc.
  • Finally, the best model will always be an ensemble. So if it is sheer competition you gotta win, try them all out and get an average score.

Language (NLP)
It seems that NLP is behind image tasks when it comes down to using NNs. However, here’s a bit:

  • If it seems like a problem could use RNNs, start with a LSTM. LSTMs have been (almost) universally used for advances in machine translation, text modelling etc.
  • In github/fastai, there’s an implementation of a more effective LSTM model, which has been called the RNNEncoder. Uses an implementation of dropout through the network that is supposed to regularize the model much better than LSTMs with naive dropout. Also, based on this paper.
  • The google word2vec or the glove seem to be for NLP, what imagenet-pretrained models are for CNNs. I.e. they represent a good starting point for embeddings for NLP tasks.
  • Finally the different flavors of RNN architectures (many to one, one to many, one to one, and many to many) have enabled their usage in many different domains. You should check them out. I apologize because I can’t seem to find a good link to them, although I know that I’ve encountered these before.

Structured Data:
I really have no experience with this. I couldn’t even understand the Rossman problem that was solved in the fastai lectures. There were so many feature engineering involved that i figured I’d have to watch the machine-learning videos.


My understand is, Google word2vec or the GloVe vectors are Machine Learning techniques, not Deep Learning.

Word2Vec and Glove are way to create Pre-trained Embeddings for Words used in Google News or Wikipedia or other corpora. You can think of these pre-trained embeddings similar to pre-trained models in images.

Both Word2Vec and Glove are based on Neural Nets (Single Layer). Hence they are Shallow leaning models, but still part of Deep Learning family since we could use TensorFlow, Keras or PyTorch to create these embeddings :slight_smile:

We’ll be learning about resnet on Monday FYI.

1 Like

That’s great, it would also be very helpful to understand how to create an ensemble of models using various methods. Thanks.

You might find this handy too. This paper compared various CNN architectures in terms of accuracy, performance, memory, speed of convergence etc. These become important considerations while targetting production.

An Analysis of Deep Neural Network Models for Practical Applications

Some researchers keep updating their line of architectures, for e.g. the Inception and ResNet families keep evolving for classification, and the R-CNN and Yolo line for object detection. Further architectures take inspiration and breakout from these.


apropos of nothing, just finished watching Andrew Ng’s Coursera Convnets week on object detection, seems like he has a (strong) preference for YOLO over R-CNN


A trivial one

What’s the main purpose of the starting few layers in the ResNet architecture?(not sure but are these called bottlenecks?)(conv-batchnorm-maxpool-relu)

Couldn’t get the reason except it’s to capture features at large?

Just regarding image classification problems, I personally try not to be overly “loyal” to any single model architecture. You may think you have the best one and end up ignoring others that could have performed better. So my general rule of thumb is to pretty much try them all out on each new dataset and empirically go with the one that performs best.

Sometimes there will be one model that clearly outperforms all the rest, but other times there is no clear “winner” and usually in that situation the best strategy is to take an ensemble of different models.

Another thing to consider is that not all architectures “like” the data in the same formats/representations. Just plugging in the same data to all models that are available won’t necessarily give you an accurate comparison. So you may need to test out many different image sizes, pre-processing steps, etc. before you can accurately come to any conclusion about which model is truly best for a specific problem.


To clarify - they’re not the starting few layers of the network, but are blocks that are interspersed at regular times throughout the network.

We need them to allow us to have stride=2 layers, and also to force the network to find different types of features.