Need help with awd-lstm translator model

imago · July 12, 2019, 12:47pm

Hello!

I’m trying to do a translator from Swedish to English which uses awd-lstm as encoder and decoder, but I’m having some issues. Some are regarding the fastai implementation of awd-lstm, some are in more seq2seq related. Anyway here’s my current attempt:

p = dict(enc_vocab_len=len(vocab_sv.itos), dec_vocab_len=len(vocab_en.itos),
              emb_sz=300,am_hidn=1152,max_len=33,am_layers=3) #params

class Seq2Seq_AWD_LSTM_v0(nn.Module):
    def __init__(self, p):
        super().__init__()
        
        #awd-lstm has built in embedding, so I train that instead of using a premade one. 
        #Unsure if works, I think so though
        self.encoder = AWD_LSTM(p["enc_vocab_len"], p["emb_sz"], p["am_hidn"], p["am_layers"])
        self.decoder = AWD_LSTM(p["dec_vocab_len"], p["emb_sz"], p["am_hidn"], p["am_layers"])
        
        self.out = nn.Linear(p["emb_sz"], p["dec_vocab_len"])
        
        self.pad_idx = 1
        self.max_len = p["max_len"]
    
    def forward(self, inp):
        self.encoder.reset() #reset states
        self.decoder.reset() #reset states

        
        #returns (raw_outputs, outputs). raw_outputs=without dropout, outputs=with dropout(except last layer)
        enc_states_nodp, enc_states_dp = self.encoder(inp)
        
        #last_hidden is of shape <list>[am_layer][2]<torch-tensor>[1,16, emb_sz]
        #what the 2, 1, 16 represents I don't know. The bs = 33
        last_hidden = self.encoder.hidden
        
        dec_inp = something #[cell_state, hidden_state] ?
        
        output_sentence = []
        for i in range(self.max_len):
            # How to feed it into decoder awd-lstm?
            dec_states_nodp, dec_states_dp = self.decoder(dec_inp) 
            
            #predict word on last cell state (?) from decoder
            pred_word = self.out(dec_states_dp[-1])
            
            dec_inp = again_something #[cell_state, hidden_state] ?
                        
            #not sure how to get the softmaxed prediction on each batch here, max is guess
            #append predicted word(index) to translated sentence
            output_sentence.append(pred_word.max()) 

            #if all batch-iterations produce padding, break
            if (dec_inp==self.pad_idx).all(): break 
                
        return torch.stack(output_sentence, dim=1) #return sentence(s)

Fast-ai related questions:

What do raw_outputs and outputs mean? My current understanding is that raw_outputs is the cell states for every layer without dropout, and outputs is the cell states with dropout (except last layer). Is this correct?
How do i get the hidden states? Also, can someone explain encoder.hidden?
How can i feed cell_state and hid_state into the decoder awd-lstm (default input is word representations, I think, not cell states)?
Can I use the awd_lstm embedding layers instead of an external one?

General seq2seq questions:

Do i predict a translated word using the lstm hidden state or the cell/context state? Meaning: what tensor do I actually pass into the linear layer to produce a prediction?

Any answer to any of my questions would be much appreciated, thanks!

sgugger · July 12, 2019, 1:10pm

raw_outputs and outputs are the outputs of every layer (so you probably want to keep the last one only) without/with dropout applied as you said. Outputs aren’t the cell state but the hidden states for each word.

encode.hidden contains the final hidden states of each layer. It It’s a list with n_layers element, each of them being a tuple (for hidden/cell) of size ndir,bs,nhid, where ndir is 1 or 2 depending on if you chose a bidirectional model (should be 1 from what I sse).

Should be as simple as decode.hidden = encoder.hidden but it depends on what you want to pass. One thing that will make life complicated for you is that hidden state is detached from its history at the end of the forward method of the AWD_LSTM and you don’t want that for your encoder, so you should remove that line from the fastai code.

Probably, they are standard embeddings.

imago · July 12, 2019, 6:53pm

@sgugger, thanks a bunch!

HenryDashwood · November 28, 2019, 4:22pm

I’m trying to incorporate AWD_LSTM into the encoder of a seq2seq model as well. Did you make any progress on this front and, if so, would you be willing to share it?

imago · November 28, 2019, 5:05pm

Well, both yes and no.

I did this as a part of my bachelor degree thesis where i tried to make a text summarizer using transfer learning and pretrained language models designed for text classification. The translator i attempted to make in this thread was mostly to learn how to work with fastai/pytorch and also to make sure that I were on the right track (if i got a translator working, i could later on plug in a summarization dataset).

But I needed some extra functionality from the AWD_LSTM which weren’t implemented in the fastai version, so i ended up writing my own using single layered LSTM’s from pytorch (using the fastai AWD_LSTM as a guide and template). I could share that one with you if you’d like but i doubt that it would be any more use than looking at the source code for the fastai AWD_LSTM.

HenryDashwood · November 28, 2019, 5:20pm

Ah I see. Yeah I’ve found myself needing to write new functionality (here if anyone is interested) in order to get beyond text classification stuff. Currently I’m struggling to get the load_encoder() method to work though.

A better approach than mine might be to started with a general Learner and build up the functionality to do language both modelling and seq2seq, rather than trying to duplicate functions like text_classifier_learner() and bend them to do things they weren’t designed to do. I’ll probably look into that next.