I have a doubt about the next word predicted by the language models we are builting. It seems I cannot achieve the same results Jeremy achieved with the Arxiv notebook. I created a portuguese language model and indeed got better results in fine tuning for a law text corpus and using it for classification purposes. The Portuguese LM achieved the same level of perplexity as other people achieved for other languages here.
But I am intrigued about the way the language model works.
In the Arxiv notebook, when the context changed (eg. category csni x category cscv) the model come with different words for the next prediction for the same sequence, depending on the category context.
When I try to emulate that using my Portuguese LM or even the Law LM fine tuned from the Portuguese LM, I always get the same prediction based only on the last word of the seed sentence. It doesn’t matter what came before. If the last token of my seed sentence is “to”, I will always get the same predictions for the next word (exactly the same probabilities), despite the previous context. I used the same code for making the predictions as the Arxiv Notebook, and when I ask it to predict the next 50 words, the sentences are coherent, so I guess the code is working ok. But I always get the same results depending only on the last word of the seed, despite the previous context.
Is that expected? What is different? Maybe using 1 cycle policy? I see in Arxiv notebook, Jeremy took a lot of epochs to converge, while using 1 cycle policy my model converged (overfitting) in 15 epochs. Has it something to do with the different results?
Thanks in advance.
Best regards to all,