This may be a very basic question but I am a bit confused about it.
So, when we create a LM we provide some input representation (of 3 chars or 8 chars etc) and then the model predicts the next character (or next word) and we compare this predicted char /word with actual and calculate Loss and then do BPTT to change gradients.
Now my question is…how exactly the model comes up with a prediction based on the input? and how this prediction changes after optimisation (i.e how exactly is the model learning)
I understand its a conditional probability based on input. but can anyone help with an example or point to a resource?