I am trying to use the reg_head for a resnet34 Model applied as included down here, where pictures of 384 by 288 pixels were used, but I do not understand where the values of the input (64 * 12 * 9) and output (6144) come from.
In this example, the keypoints number was 12, how to reflect this on my model if I am going to detect only two keypoints resizing my images to 224*224 pixels?
Link to the ex.: https://towardsdatascience.com/hand-keypoints-detection-ec2dca27973e
head_reg = nn.Sequential(
nn.Linear(64 * 12 * 9, 6144),
learn = create_cnn(data, arch, metrics=[my_acc,my_accHD], loss_func=F.l1_loss, custom_head=head_reg)
As I understood, a new filter of 9x12 was assumed. The output was assumed to be the product of 64*(k=96). The number 96 -I am not sure- is the ´convolution volume depth assumed here https://cs231n.github.io/convolutional-networks/#pool.
The value of K 96 could be considered as a default value which could be applicable to other cases like in my case, where 2 Kpts to be detected? Or other values should be assumed?