RNN Decoder Structures

I’m looking at a number of encoder/decoder structures for text generation (ie image captioning, seq2seq, etc). In all models, the encoder takes some input (image, text, etc) and maps it down to a vector. That vector is then the input to the decoder which runs an iterative generation process.

One big difference I’ve seen between models is how that encoder vector is used. In some models, the encoder vector is used to initialize the hidden state of the decoder. In others, the decoder hidden state is initialized to zero, and the encoder vector is the first input to the decoder (ie in place of a BOS token).

Does anyone know of any literature comparing these approaches?