The second step here defines and initializes a CNN. However, can someone help me understand how the output for the first conv2D layer is 64? Likewise, I don’t understand the source for the (9216, 128) numbers used to define the first fully connected layer either. Can someone explain these?
I recommend spending some time studying CNNs - for example I can recommend the deeplearning.ai videos on YouTube for that, they explain the shapes very well imo.
The output for any conv layer is whatever you want it to be - this is a pretty standard “old-school” CNN, where you double the number of output channels (32 > 64) and do maxpool between layers.
The input for the first FC layer is exactly the shape of the expected input - in this case, 12x12x64=9216.
Thanks, this helped a lot. I was confused even then until I realized the tutorial code does not have a maxpool after the first conv2D layer, only after the second. Is this commonplace too?
Maxpool is basically a way of reducing computational load, as it reduces feature map size by 4 times (2x2). Therefore on small input sizes it’s much less important to downsize aggressively (224x224, which is standard ImageNet input size, is like 50x bigger than 28x28 MNIST input here). I think maxpool isn’t very common these days, but feel free to experiment with maxpool (and with conv stride) and see what works out best for you