Crazy Thoughts -- Residuals, Transfer, Capsules

Borz · November 14, 2017, 1:13am

TL;DR: the italics below

An idea popped up after reading a few papers, specifically:

Decoding the ResNet architecture blog post by Anand Saha
One Model To Learn Them All by Łukasz Kaiser et.all
CS231n Lecture 7 (W 2016) - Andrei Karpathy (spec. last 20 minutes)

And hearing about G.Hinton’s Capsule Networks - blog post by Max Pechyonkin

This is about rapid architecture development & transfer learning. There’s already work underway to use neural nets to create nerual net architectures (Google Research Blog), that’s great but even a perfectly tuned “AI Network Architect” faces one iron–law bottleneck: the new architecture has to be retrained on the entire base dataset every time a change is made.

This makes proven architectures valuable with their pretrained weights, and pushes architecture development towards institutions with the resources to speed up training.

But every image classifier has to learn what a circle is, or what different angled lines and curves are… are there universal low–level features that just come with the territory?

Patching together what Residual Networks actually do from the A.Saha post and A.Karpathy’s lecture, with the apparent domain–independence of Transfer Learning from the Ł.Kaiser paper, and what I’ve heard about G.Hinton’s Capsule Nets:

Could we train a smaller/modular network on low–level Convolutional features, and then use that network as a Residual Module to place alongside the early layers of whatever architecture we’re prototyping, to speed up the training process?

IE: The network-module itself acting as a residual?

I haven’t read the G.Hinton papers yet, or the blog post by M.Pechyonkin, but I’ve heard they’re trying to get at capturing some ‘universal–ish’ information of the world, instead of starting learning from scratch every time.

I may do a more in-depth writeup on this later, but for now interested in what @jeremy & others may have to say.

I don’t immediately see why architecture design can’t be automated – it’s already moving in that direction along w/ hyper–parameter (look at fastai’s own learning-rate scheduler) – and seeing that AlphaGo’s architecture as of ~Jan 2016 wasn’t nearly as complicated as I expected it to be (from the end of A.Karpathy’s lecture)… it feels like removing bottlenecks like this is important.