Covering Capsule Networks in part 3

Hello @jeremy

I guess it would be great to cover Capsule networks in part 3. I know there are a lot of online resources but none cover implementation from scratch.

Also please discuss (live if possible)

Despite being better (conceptually/in theory), why Capsule Networks haven’t been widely adopted into the new state-of-the-art algorithms/models yet. Is it because of the lack of experimentation?

I tried incorporating Capsule networks twice (in competition) it didn’t work well as compared to convolution layer models (I am still learning, its possible I would have messed up somewhere).

(Everyone please heart this topic if you want this to be answered on live stream.)

Thank you so much.


would love to hear from Jeremy his thoughts on capsule network.

Capsule Network is a very premature research idea and is not very practical at this point. In theory, it is supposed to be more “biologically plausible” (i.e. more similar to our brain) and have some desirable properties such as “robustness to affine transformations” (i.e. fancy way to say if you rotate the image, rescale it or have multiple overlapping images, the model should be able to predict accurately even if the example transformations were not covered in the train set). At the same time, instead of simply outputting the prediction distribution, the output of a capsule is a vector (for each class) that allows for recreation.

I implemented the paper: and tested on many datasets and haven’t found a single use-case where ResNet or DenseNet wouldn’t outperform Capsule Network. Having said that, it’s an interesting idea to watch for in the future. I believe the authors summarised the state of Capsule networks in the paper accurately:
“Research on capsules is now at a similar stage to research on recurrent neural networks for speech recognition at the beginning of this century. There are fundamental representational reasons for believing that it is a better approach but it probably requires a lot more small insights before it can out-perform a highly developed technology.”


Thanks for the great explanation.

1 Like

great explanation

Great reply!

Anyone else curious about the question I asked here, can read the paper with the title Capsules for Object Segmentation. Its very nice paper published recently which explains how the Capsule Networks are inefficient as compared to convolution models.

(pasting the gist here.)

The original capsule network architecture and dynamic routing algorithm is extremely computationally expensive, both in terms of memory and run-time. Additional intermediate representations are needed to store the output of “child” capsules in a given layer while the dynamic routing algorithm determines the coefficients by which these children are routed to the “parent” capsules in the next layer. This dynamic routing takes place between every parent and every possible child. One can think of the additional memory space required as a multiplicative increase of the batch size at a given layer by the number of capsule types at that layer. The number of parameters required quickly swells beyond control as well, even for trivially small inputs such as MNIST and CIFAR10. For example, given a set of 32 capsule types with 6 × 6, 8D-capsules per type, being routed to 10 × 1, 16D-capsules, the number of parameters for this layer alone is 10 × (6 × 6 × 32) × 16 × 8 = 1, 474, 560 parameters. This one layer contains, coincidentally, roughly the same number of parameters as our entire proposed deep convolutional-deconvolutional capsule network with locally-constrained dynamic routing which itself operates on 512 × 512 pixel inputs.

(thanks to this forum post for posting about the paper. Its interesting to see if it can really be useful after proposed changes (SegCaps). keras code is implemented here. AFIK It hasn’t been implemented in pytorch yet)

1 Like

I implemented the paper in PyTorch 0.3.0 when the paper was published one year ago. I wanted a decent PyTorch implementation of CapsNet and I couldn’t find one at the point when I started. So, I started this project as a random weekend hack to do just that.

The model was trained on the standard MNIST data. I have tested using other datasets such as CIFAR10.

The implementation comes with a proper documentation and codes commented using Python docstring.

Pre-trained model and weights are available for download as well.

Total number of parameters on (with reconstruction network): 8227088 (8 million). If we compare that to ResNet, I think this is not so practical (in terms of efficiency).

By reading this paper and digging deeper into the implementations, I got similar insights as the way Daniel Havir put it.

Learned a ton of PyTorch along the way. Now, the repo is showing its age now. Definitely needs some dusting soon :smiley: