I have gone over all the non-swift videos and I am working on creating a model to train on images (Imagenet). But, I am stuck on one part of my conv layer.
I will resize the images to 241x241x3 and my first layer will be a nn.Conv2d(3, 32, 3, stride=1) (all no padding) and possibly an activation (I want to experiment with tanh). so that i now have a 239x239x32 output.
Then i plan on using steps that will consist of a nn.Conv2d(32, 32, 1, stride=1) followed by my trouble layer than an activation.
i want to take the HxWx32 input and use (3,3,3) kernels stride 1 that cover the whole WxH but only see three filter layers. I want to use 32 such kernels so that each kernel only creates 1 filter (this is similar to groups but i want overlap). Also, the 32 filters would connect in a ring so that the top and bottom don’t need padding. The idea is that the first Conv would find the interesting features across the filters and create essentially a feature map. then the second conv would only look at three of these at a time while also expanding the HxW view. these two would be followed by a tanh activation. this would reduce the hxw by 2 each since no padding would be used. I could really use help implementing this conv layer. I would use 4 steps then go to my next feature.
nn.Conv2d(32, 32, 3, stride=3). I use stride 3 because I don’t like how stride 2 only sees the center once, sides twice, and corners 4 times. you are adding extra importance to arbitrary pixel locations. this would cut my height and with in 3. i will call this step 2. so my architecture would be initial follow by (step1x4 step2) x3 this will bring the area from 241x241x3 to 5x5x32 at which point i will go into a fully connected section. which i can go into more detail if anyone is interested.