in addition to regular encoding and decoding part, the decoder part gets trained separately on large text corpora, which is a new separate dataset. This dataset is usually obtained (generated) from the original set, using different techniques, for example: in a dataset, making new valid
sql queries with
select * table. This essentially improves the decoder's loss function.
A little longer answer:
To make sure I'm not going to miss out on any information. In the basic sequence to sequence learning models that leads to generating an output, where it's:
- encoder, encodes the input into some representation.
- decoder, takes in the encoder's out-put while keeping the previous states in memory.
This is good for natural language related models, in terms of dimensionality reduction and capturing actual important bits, however, it struggles with constructing proper structure.
The part it struggles with is the decoder where the mapping between input (to encoder) and to last stage happens, due to Decoder not having enough ground truth perform the Loss Function.
Among all the other techniques, such as adding Attention, RL agents and combined Loss functions for decoder. Being able to train on more valid data for the decoder have done a lot better in terms of performance.