Configuring stateful lstm cell in the the language model

Are the lstm cells in the language model an unrolled lstm or a stateful lstm? If they are unrolled how do i configure them to be stateful?

That concept is only really relevant to static computation systems like tensorflow. I can’t think of a reason why any RNNs in pytorch wouldn’t be stateful.

I see I am asking because i m trying to reproduce text generation of in the lang_model-arxiv notebook with a music lyrics dataset and I am getting a lot of the same repeat sentences:

.n't be no one , i 'm a been a one , i 'm a been a one , i 'm a been a one , i 'm a been a one , i 'm a been a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one , i 'm a one ,

I’ve tried increasing the bptt but didn’t seem help. Maybe i need to train the model more since it might be underfitting? Or since lyrics has a lot of repeat chorus it might be more prone to this kind of repeat?

Repeating text during generation isn’t unexpected, especially, as you say, when it’s common in the input. You’ll need to come up with some different way to generate the next word, rather than always picking the highest probability prediction. E.g. if you’ve picked that word recently, pick the 2nd highest instead.

1 Like

I had that problem a few times with various datasets when I only did one epoch of training.

Maybe more data or training would help. How many words are in your dataset?

Like Jeremy suggested, you can try to sample from the predictions based on the probabilities. I use torch.multinomial to do that:

# Get the last predictions
pred_last = pred[-1]
# Compute probabilities using softmax
pred_last_prob = pred[-1].exp() / pred[-1].exp().sum()
# Randomly sample one word
pred_last_choice = torch.multinomial(pred_last_prob, 1)


Hi @jeremy, just wanted to make sure that the output of the language model decoder is not fed into a softmax layer because, when I initially tried to just take a exp of them, I didn’t find them add up to 1. Also I didn’t see a softmax operation in this source code.

Very well spotted! The trick here is that RnnLearner defines crit as cross_entropy, which in Pytorch actually includes the softmax. It’s a little numerical stability trick IIRC.

I am working through the StatefulLSTM model generating text when I stumbled on the multinomial aspect in the test stage. Are we using torch.multinomial to introduce some randomness into our selection process? - so as to avoid picking the word with the topmost probability?

def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1) # <----- ?????
    return TEXT.vocab.itos[to_np(r)[0]]

def get_next1(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0, 1)))
    r = torch.max(p[-1], 0)[1]
    #r = torch.topk(p[-1], 1)[1] # Both .max(x, 0) and .topk(x, 1) return the same result
    return TEXT.vocab.itos[to_np(r)[0]]

The get_next1 is the approach we’ve taken in the basic RNN model. I was comparing the results of get_next1 to get_next and the results are certainly different. So, multinomial is certainly doing something that I’m unable to wrap my head around.

Also, any idea why are we taking an exponent inside the multinomial function? Numerical reasons?

Hello, maybe you already solved your problem but I found an explanation here about using exponent inside multinomial.