I am experimenting with a convolutional network, and I noticed that if I make my big dense layer at the end too large, that my loss doesn’t go down as fast as with a smaller layer. I get a lot better performance out of a model with 256 dense nodes than 1024 dense nodes. Is there any guideline for how many nodes to put in the big dense layer?
There is a famous result in DL that one can approximate any function with a 1-layer perceptron, provided it is large enough… To your question, a large dense layer has two effects:
- It’s like a small learning rate: you get better results at the end but at the cost of a slower learning, because of many more parameters to train.
- It “deregularizes” the approximation function as you allow the parameter associated to this layer to move in a much more dimensional space: apply a dropout with a quite large proportion a a L1-L2 regularization
Quick update on this issue. I didn’t find any guidelines on this, so I did some tests with
64, 128, 256, 512 nodes in the dense layer, to compare accuracy, validation accuracy, and training time. I was kind of surprised by how accurate my 64 node version was. It was reaching accuracies I had hoped that adding more nodes would help me get to, in very few epochs. It’s doing so well, I am going to experiment with removing the dense layer altogether.
I’ll post back if I have any more interesting results.