Some vague questions about Transformers

Hi all. Can anyone clarify or comment on some confusions I have about Transformers?

The “Attention Is All You Need” paper describes multiple attention heads. As I read it, it divides the the encoding of each word into subsets of features, and trains each head on one of the subsets. The heads then learn different aspects of the relationship between words. The results of the attention heads are concatenated to form the input to the decoder.

Do these different heads learn different things because they see different subsets of the encoding, or because they are initialized differently? It will be both of course, but what is the primary reason that the architecture’s six attention heads learn different things?

I’m asking because I am experimenting not with word encodings as input, but with features derived from a time series. But any way I slice and dice it, loss is not improved by having multiple attention heads. I have tried partitioning the features into subsets given to each head, as well as giving all features to each (differently initialized) attention head. Nothing works better than giving all features to a single attention head.

Any comments and insights are appreciated!