Language model text generation giving weird results in a consistent manner


(Theodoros Galanos) #1

Hello everyone,

So I have managed to train a reasonably good Bengali model for a research I am conducting. I used the typical workflow described in many notebooks, including fast.ai ones, along with the wikipedia dump of the bengali language. My loss is around 4.25 (66 perplexity) which is ‘good’ at least compared to other results I have found online for this specific language and such a small dataset.

However, when I try to generate language out of the model I get kind of bad performance. Now, this isn’t necessarily why I post this. The perplexity value above isn’t the best and as Jeremy mentions in the class even a minor change in the loss (e.g. taking that down to <4) can give you a model that goes from generating nonsense to generating something that has decent structure. The problem I see is that no matter how much I train, even on different datasets, I get the same exact pattern in my generated text which is a certain number of words (as expected) and then a series of tokens that gets replicated if I try to increase the hallucinated text length.

I have seen this around in some other repos and implementations but I don’t think it has been discussed. Since it happens consistently in this manner across many models I thought to ask if the code I am using for generating text is the issue (albeit I have simply copied that code from other notebooks). I am posting below the three scripts I use to do that. I would appreciate if someone can point out some glaring mistake I’ve made, which I can’t find, or perhaps if you’ve had similar experience when training language models.

The first two scripts, taken directly (I believe) from fast.ai notebooks:

def gen_text(ss,topk):
    s = word_tokenize(ss)
    t = LongTensor([stoi[i] for i in s]).view(-1,1).cuda()
    t = Variable(t,volatile=False)
    m.reset()
    pred,*_ = m(t)
    pred_i = torch.topk(pred[-1], topk)[1]
    return [itos[o] for o in to_np(pred_i)]

def gen_sentences(ss,nb_words):
    result = []
    s = word_tokenize(ss)
    t = LongTensor([stoi[i] for i in s]).view(-1,1).cuda()
    t = Variable(t,volatile=False)
    m.reset()
    pred,*_ = m(t)
    for i in range(nb_words):
        pred_i = pred[-1].topk(2)[1]
        pred_i = pred_i[1] if pred_i.data[0 ]< 2 else pred_i[0]
        result.append(itos[pred_i.data[0]])
        pred,*_ = m(pred_i[0].unsqueeze(0))
    return(result)

Another script taken from other posts in the forum:

def sample_model(m, s, l=50):
    s = word_tokenize(s) 
    t = LongTensor([stoi[i] for i in s]).view(-1,1).cuda()
    t = Variable(t,volatile=False)
    m[0].bs=1
    m.eval()
    m.reset()
    res,*_ = m(t)
    print('...', end='')

    for i in range(l):
        n=res[-1].topk(2)[1]
        n = n[1] if n.data[0]==0 else n[0]
        word = itos[n.data[0]]
        print(word, end=' ')
        if word=='<eos>': break
        res,*_ = m(n[0].unsqueeze(0))

    m[0].bs=bs

Thanks in advance for any help!

Kind regards,
Theodore.


#2

There are several things to check (based on the last ‘sample_model’):
1, the input ‘t’ can be converted back to the original sentence ‘s’. If itos or stoi has problems, the model will not produce any meaningful sentence;
2, the model output ‘res’ is a list of vocab_size, containing both probability for each word;
3, manually investigate the output ‘res’ with a simple short sentence, especially one which appears in the training corpus;


(Marcin Kardas) #3

Which encoder are you using, MultiBatchRNN? Bear in mind that it resets its hidden state before every inference. Therefore you could use other encoder, or simply provide total context.
For Polish, using sample_model() (actually I had to add another .unsqueeze(0)) I got dummy results:
On może... być może być może być... (He can… be can be can be…). By providing full context in every iteration:

for i in range(l):
    res,*_ = m(t)
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    word = itos[n.item()]
    print(word, end=' ')
    if word == '<eos>': break
    t = torch.cat((t, n.unsqueeze(0).unsqueeze(0)))

I get more natural sentences: On może... być celem leczenia . <eos> (He can be… a target of treatment).


(Theodoros Galanos) #4

Thanks, I tried the suggested code with some success. I think adding the context helps indeed! However, I did run into some errors on some parts that I had to adjust. For example, n.item() doesn’t work for me (probably a version issue?) even though it should since n has only one value. Additionally, the double unsqueeze gives me a mismatch in shape between the two Variables. The code works and produces ok results without it but I’ll try and make it work out of curiosity.

Kind regards,
Theodore.


(Marcin Kardas) #5

Yeah, I’ve forgotten that fast.ai by default uses pytorch<0.4 and pytorch 0.4 (which I’m using) treats scalars differently:

import torch
print(torch.__version__)
print(type(torch.FloatTensor([1])[0]))

0.3.1.post2
<class ‘float’>

vs

0.4.1
<class ‘torch.Tensor’>


ULMFiT - German
(Karl) #6

I have found the issue is using torch.topk to generate predictions. Using torch. multinomial instead yields more variety in prediction.


(Theodoros Galanos) #7

Thanks! I will try that one.