Hello everyone,
So I have managed to train a reasonably good Bengali model for a research I am conducting. I used the typical workflow described in many notebooks, including fast.ai ones, along with the wikipedia dump of the bengali language. My loss is around 4.25 (66 perplexity) which is ‘good’ at least compared to other results I have found online for this specific language and such a small dataset.
However, when I try to generate language out of the model I get kind of bad performance. Now, this isn’t necessarily why I post this. The perplexity value above isn’t the best and as Jeremy mentions in the class even a minor change in the loss (e.g. taking that down to <4) can give you a model that goes from generating nonsense to generating something that has decent structure. The problem I see is that no matter how much I train, even on different datasets, I get the same exact pattern in my generated text which is a certain number of words (as expected) and then a series of tokens that gets replicated if I try to increase the hallucinated text length.
I have seen this around in some other repos and implementations but I don’t think it has been discussed. Since it happens consistently in this manner across many models I thought to ask if the code I am using for generating text is the issue (albeit I have simply copied that code from other notebooks). I am posting below the three scripts I use to do that. I would appreciate if someone can point out some glaring mistake I’ve made, which I can’t find, or perhaps if you’ve had similar experience when training language models.
The first two scripts, taken directly (I believe) from fast.ai notebooks:
def gen_text(ss,topk):
s = word_tokenize(ss)
t = LongTensor([stoi[i] for i in s]).view(-1,1).cuda()
t = Variable(t,volatile=False)
m.reset()
pred,*_ = m(t)
pred_i = torch.topk(pred[-1], topk)[1]
return [itos[o] for o in to_np(pred_i)]
def gen_sentences(ss,nb_words):
result = []
s = word_tokenize(ss)
t = LongTensor([stoi[i] for i in s]).view(-1,1).cuda()
t = Variable(t,volatile=False)
m.reset()
pred,*_ = m(t)
for i in range(nb_words):
pred_i = pred[-1].topk(2)[1]
pred_i = pred_i[1] if pred_i.data[0 ]< 2 else pred_i[0]
result.append(itos[pred_i.data[0]])
pred,*_ = m(pred_i[0].unsqueeze(0))
return(result)
Another script taken from other posts in the forum:
def sample_model(m, s, l=50):
s = word_tokenize(s)
t = LongTensor([stoi[i] for i in s]).view(-1,1).cuda()
t = Variable(t,volatile=False)
m[0].bs=1
m.eval()
m.reset()
res,*_ = m(t)
print('...', end='')
for i in range(l):
n=res[-1].topk(2)[1]
n = n[1] if n.data[0]==0 else n[0]
word = itos[n.data[0]]
print(word, end=' ')
if word=='<eos>': break
res,*_ = m(n[0].unsqueeze(0))
m[0].bs=bs
Thanks in advance for any help!
Kind regards,
Theodore.