I have been reading up a little on parameter prediction networks (networks that predict the weight matrices for other networks) in this paper and this paper. A very simplified example of this would include two networks:
a base network with one weight matrix containing a very large number of parameters (call it W).
an auxiliary network which outputs a matrix that is the same shape as W
The output from the auxiliary network replaces the weight matrix in the base network. Both networks are trained at the same time. In Keras this would look like one network, potentially with multiple inputs and outputs.
Just for fun, I wanted to see if I could write up a parameter prediction network in Keras. Assuming I train the base network and auxiliary network at the same time, I wasn’t sure how to tie them together. More specifically, I can grab the output tensor of the auxiliary network. I need to take that tensor and use it as the weight matrix in a Keras layer. I looked around at Keras and while it appears possible to set the initial values of a weight matrix it doesn’t appear possible to wrap a layer around an existing tensor and use it as the layer’s weight matrix. I’m also unsure if the shape of the output tensor will be acceptable for the shape of the tensor for the weight matrix in the used in the base network.
Assuming end to end training, is it possible to grab the output tensor from one layer or model in Keras and use it as a weight matrix in another layer? Assuming I can somehow load a tensor into a layer will I run into tensor shape issues? Any help with this would be much appreciated.
My first read through the gokceneraslan/dietnet repo didn’t quite seem like they’d implemented the shared weight matrix in a way that really reduced the number of params but I have to admit I read this rather quickly.
In the other repo, the code by the author of the Diet Networks paper felt more like the implementation described in HyperNetworks paper than the simplified architecture she described in her own paper.
These models compile but I haven’t had a chance to run them yet.
Some notes:
I don’t know how you’d train this in any sort of mini-batch fashion.
I wasn’t totally confident in how I handled reshaping the Embedding layer outputs in the auxiliary network. I’d love feedback.
Obviously, I used a Lambda layer to attempt to use the output of the auxiliary network, We, as a weight matrix in the base network. I believe this works but again, I’d love a second opinion.
If you do full batch learning (which the current code does) you likely don’t need two different inputs. I believe you can just reuse and transpose the original input in doDot.