Keypoints Detection

I am trying to use the reg_head for a resnet34 Model applied as included down here, where pictures of 384 by 288 pixels were used, but I do not understand where the values of the input (64 * 12 * 9) and output (6144) come from.

In this example, the keypoints number was 12, how to reflect this on my model if I am going to detect only two keypoints resizing my images to 224*224 pixels?

Link to the ex.: https://towardsdatascience.com/hand-keypoints-detection-ec2dca27973e

head_reg = nn.Sequential(
nn.Conv2d(512,64,kernel_size=(1,1)),
nn.BatchNorm2d(64),
nn.ReLU(),
Flatten(),
nn.Linear(64 * 12 * 9, 6144),
nn.ReLU(),
nn.Linear(6144, 24),
Reshape(-1,12,2),
nn.Tanh())
learn = create_cnn(data, arch, metrics=[my_acc,my_accHD], loss_func=F.l1_loss, custom_head=head_reg)

Update:
As I understood, a new filter of 9x12 was assumed. The output was assumed to be the product of 64*(k=96). The number 96 -I am not sure- is the ´convolution volume depth assumed here https://cs231n.github.io/convolutional-networks/#pool.

The value of K 96 could be considered as a default value which could be applicable to other cases like in my case, where 2 Kpts to be detected? Or other values should be assumed?