Hi! I’ve fine-tuned a TransformerXL model from a Spanish pretrained model I found on github.
I’d like to use the encoder to extract the embeddings and create vector representations of documents, and I’ve tried doing so but I have a couple of questions.
I see the model has an
encoderlayer at the beginning, and using just that works fine (as in, it outputs a vector of length 410 per each word of the input), but I’m not using any of the other layers where the attention mechanism is implemented, which are inside
DecoderLayers. Should I be using those too? Are those layers part of the encoder, despite the name?
The way I turn a whole string into a single 1230-dimensional vector is taking all the 410-d vectors of the encoder’s output (one for each input word), and concatenating their mean, their max, and their last word’s embeddings (saw that one online and I’m not sure why). Is there a better way to summarize the information that the encoder spits out, so that it keeps as much information as possible about the original string?