Data preparation for language model

(Nate) #1

I’m trying to understand how the data preparation for a language model works. The get_batch function in LanguageModelLoader returns

self.data[i:i+seq_len], self.data[i+1:i+1+seq_len].contiguous().view(-1)

I know the purpose of the model is to predict the next word given the preceding sequence, so I was expecting a sample to be a sequence and the label to be the word following that sequence, say data[0:50] and data[50]. However, it seems that the sample and label have the same length, just shifted over 1, so something like data[0:50] and data[1:51]. I can’t quite wrap my mind around how this is working.

0 Likes

(Michael) #2

The language model tries to predict the next word after each word in the sequence.

Here you find a nice illustration of the language model setup on the left side of the figure: https://twitter.com/thom_wolf/status/1186225108282757120?s=21

1 Like