Need help understanding the #TagSpace paper

I am reading the TagSpace paper and I am having some difficulty in understanding Section 3 - Convolutional Embedding Model and also the Figure provided for it.

A convolution layer is then applied to the l × d input matrix, which considers all successive windows of text of size K, sliding over the document from position 1 to l. This requires a further Kd × H weights and H biases to be learned. To account for words at the two boundaries of the document we also apply a special padding vector at both ends. In our experiments K was set to 5 and H was set to 1000. After the convolutional step, a tanh nonlinearity followed by a max operation over the l × H features extracts a fixed- size (H-dimensional) global feature vector, which is independent of document size. Finally, another tanh non-linearity followed by a fully connected linear layer of size H × d is applied to represent the entire document in the original embedding space of d-dimensions.

I don’t understand what H stands for and how in the figure a matrix of dimensions (l + K - 1) x d was transformed to a matrix of l x H. As it is a convolution operation, I was expecting an output matrix of dimensions l x d.

NVM, I got confused. H was the number of filters in the conv layer. The math works now :sweat_smile:.