In lesson 3 it was mentioned that we don’t want to specify the filters to use (like top edge detection etc), as the network can do better to learn the correct filters itself, but I’m having some trouble reconciling that with my naive intuition about it.
I can understand why you wouldn’t want to constrain the 2nd layer’s input to only the preconceived features, but it’s seems strange that there isn’t value in having a subset of the filters in the first layer be fixed to some “orthogonal”/linearly independent set of primitive filters to act as a basis, in addition to some randomly initialized, learned set in the same layer. For example, filters that picked out horizontal & vertical lines, and filters that detect curves seem like they’d be pretty fundamental and useful building blocks for any network doing image recognition, regardless of whatever higher-level features you were looking for. You could still have the learned filters in the first layer greatly outnumber the fixed ones, so I wouldn’t think you’d be stifling the options for alternative low-level feature extraction.
In short, why isn’t it be better to give the network a solid set of those building blocks to use in tandem with whatever it comes up with on its own for the first layer?
I do not have a good theoretical answer to your question. One could hypothesize that learning works better than handcrafting as a filter learns what would be best for it to do given what all the other filters are doing, but than again that is not necessarily a property you might always want as it might lead to overfitting the train set.
One could also hypothesize that the reason for this is that learning allows the model to come up with filters most useful for the task at hand, not necessarily ideal in some objective sense. But for low level filters and given the complexity of images I am not sure that this statement is also extremely valid.
On the other hand, there is absolutely nothing preventing you from trying out hand crafted filters. You could try them out on their own - just slapping a set of fully connected layers on top, or you could try that in conjunction with filters that can learn. I am not yet 100% sure how to do it in keras, but from a theoretical perspective such model should be easy to build. Sounds like that would be a fun project!
Well, those are my two cents for what they are worth Still relatively new to all this so please take my words with a grain of salt!
Think it’s less that it’s strictly better to let the network learn the first layer weights from scratch and more that it usually isn’t worth the effort to set them by default.
Most of the first layer filters end up approximating Gabor filters, so you could try initializing with those. In addition though, you would actually need to freeze that first layer when you initially start training, otherwise your preset filters will get thrashed by the gradients from the random initialization of your later layers.
I don’t think there’s a way to freeze part of a layer (at least not in vanilla Keras), so you would have to freeze the entire first layer even if some filters were random. After a few epochs you could then unfreeze that layer and train the whole thing.
@radek - you can get and set weights in Keras. Easiest way to see the right format is to try get weights on your layer in question and then tweak those weights and feed them back to that layer.
Manually convolving/crosscorrelating in NumPy works too but then those features won’t be trainable.
Yeah, I was thinking it would be difficult to only partially freeze the first layer, while still letting the rest train. If I get some time I may try to figure out a way to make it work, but it would be good to know if there’s some theory out there that says it’s a fool’s errand.
One way to implement it might be to have the first layer consist of x hand-crafted filters and the remaining y filters just be no-ops / identity, so that they pass straight through to the next layer. I could then freeze that layer and have the subsequent layer have x+y filters and train from there. I don’t know if that could cause problems, since the 2nd layer would have a mixture of raw pixels + low-level features (so 1 abstraction step removed), but it might give something interesting.
If anyone else has insights into this approach (or whether it’s just a bad idea) I’d love to hear it.
Actually, just realized there’s an easy way to get a “partially trainable layer” - just split it into two convolutional layers with the same input and do a concat merge at the end.
You can then set one layer to be trainable, the other not, and after the merge it will be like one partially trainable layer.
(note: this requires the functional API)
Nice, I’ll need to check that out!