Could we try a meta-residual style architecture to improve this model?
That is: train a secondary seq2seq model that takes the output of this first seq2seq translator and then use that as a new input to predict the original sentence.
From what I’ve seen, the term “end-to-end” is used when someone discovers an “ordinary” neural network way to do something that previously required hand-crafted features or something else.
You really could do that. There is a family of networks called Transformer nets, wherein nothing but attention(not even RNNs) is used. Language models, seq2seq trained with this architecture seem to be very much competetive when compared with traditional RNN based language models.
Could multi-class categorization be really modelled as seq2seq? I reckon seq2seq could be used when there is a sequence(one after the other) of tokens as output, whereas multi-class output doesn’t really form a sequence(as in no dependency between classes). Please correct me if I’m wrong.
Depends. Could be highly correlated product categories. For example, the games category in amazon could be ‘games - board’, ‘games - computer’, ‘games - strategy’ etc etc and we might want to distill classes of goods into ‘games-computer-strategy’.
Could also model this just like a softmax but replace with sigmoid function.