I am reading the TagSpace paper and I am having some difficulty in understanding Section 3 - Convolutional Embedding Model
and also the Figure provided for it.
A convolution layer is then applied to the
l × d
input matrix, which considers all successive windows of text of sizeK
, sliding over the document from position1 to l
. This requires a furtherKd × H
weights andH
biases to be learned. To account for words at the two boundaries of the document we also apply a special padding vector at both ends. In our experimentsK
was set to 5 andH
was set to1000
. After the convolutional step, atanh
nonlinearity followed by a max operation over thel × H
features extracts a fixed- size (H
-dimensional) global feature vector, which is independent of document size. Finally, another tanh non-linearity followed by a fully connected linear layer of sizeH × d
is applied to represent the entire document in the original embedding space ofd
-dimensions.
I don’t understand what H
stands for and how in the figure a matrix of dimensions (l + K - 1) x d
was transformed to a matrix of l x H
. As it is a convolution operation, I was expecting an output matrix of dimensions l x d
.