Conceptual Doubt (Input and Output for Language Model)

Hello,

Sorry if this question is very obvious.

I have a document:

[I ate the food]
[I am eating today]
[I cannot do it]
[What are you doing]

During the fine tuning step,

How does one training example look like?

That is , Is the tokens [“I”,“ate”,“the”] trained on “food”

Is there a window size here. I am confused.

Please let me know, how the actual training is .

For example in word2vec, there is a sliding window for words to be trained