Batch and bptt in language model data matrix

This question is from the 4th video at this time.
It may seem to be a very beginner question but I seriously need help on this.

image
In the above image, Jeremy says we segment the review in 64 words each and place them one of the tops of another. This way the matrix becomes 10 million X 64. Then we take bptt i.e 70 rows to feed in the neural network. If we see this way the first 64 words from the first review in the first row and second 64 in the second row of the 70 X 64 matrix.

But later in the same video Jeremy says,
that the first column is the first 75 words of the first review.
image

What we learned in the first example is that 64 was the length/window of the words. Its totally contradicting. Please help me clear on this and where I am getting wrong?

Thanks in advance.

3 Likes

I said that we create 64 equal-sized batches, not batches of size 64! :slight_smile: So that batches are of size nwords/64 .

6 Likes

@jeremy But if we break sentences and stack them then the next word will be in the next column not in the next row.

True, but if we have a million rows, and we split them into 64 groups, then this happens only 64 times out of 1000000, which isn’t really a problem.

1 Like

Searching for this question I reached this post. Despite Jeremy’s answer I’m still not confident about my comprehension.

Let’s take a basic example:

Review 1:
Beautifully shot and amazing action sequences. Directed brilliantly with amazing acting. Makes up what it lacks in a plot and is really spectacular

Review 2:
Am I the only one who thinks this is a bit overrated? The set design and cinematography are phenomenal, but the rest of it didn’t seem all that special to me.

Concatenation:
beautifully shot and amazing action sequences . directed brilliantly with amazing acting . makes up what it lacks in a plot and is really spectacular am i the only one who thinks this is a bit overrated ? the set design and cinematography are phenomenal , but the rest of it did n’t seem all that special to me .

We have 60 words. Let’s create 6 batches of size 10:

Then, if bptt were 4 we would take the first 4 rows. Am I wrong ?

Thanks!

6 Likes

No, you are not wrong.

I was also confused after reading Jeremy’s answer and I then watched the video again and everything got clear.

2 Likes

The terminology batch_size and number of batches is used interchangeably. I watched the video and understood it clearly after referring to this thread. :slight_smile: I always go to size of a batch as my definition for the word batch_size. In this case however, size of a batch (nwords/batch_size) is calculated depending on the batch_size

1 Like

Maybe this is beating a dead horse, but I really wanted to make sure I understand this too.

It seems that with the example that Jeremy used, we took all the reviews and concatenated them together and got a single array that was 64 million words long. Since we are using a vocabulary, we can then replace the words with an integer value (words whose frequency are <10 would be replaced by the equivalent integer for “UNKNOWN”) which allows us to keep the data in matrix format and not rely on characters.

Since we are creating 64 equal-sized batches this would be equivalent to dividing the original 64 million long array into 64 separate 1 million long arrays. Let’s say we label these as B1, B2, …, B64 (each is a row vector 1 million long). Then the new matrix we’ve created is [B1^T B2^T … B64^T], which is to say the transpose of the row vectors (making them column vectors) and then concatenating them (maybe not the correct word to use?) such that we have a matrix of size 1,000,000 x 64.

So there is no problem with BPTT because if you look at the diagram, we are taking a chunk of data that is BPTT x BS in size. Each column (of length BPTT) preserves the word order of the sentences in a review, and I guess we would theoretically be processing sentences from 64 reviews (maybe more if we are at an overlap between reviews). We hit the problem at the very tops and bottoms of the matrix, but this is only going to happen for however big you make your BS to be.

Just to clarify, I’m guessing that actually, we run into this problem more often since our data is actually a concatenation of reviews, so there will be some cases where our BPTT x BS data chunk will have sentences from different reviews in the columns. I guess with such large data this becomes less of a problem.

Since data matrix size is 10,000,000 x 64, every time I call next(iter(md.trn_dl)) I expect to get the next bptt rows from the overall 10,000,000. However, I am getting always the bptt FIRST rows (only changing the bptt randomly). It seems to me that only the first rows are being used for training, while the others 10,000,000 are being thrown away… I will appreciate a clarification about it.

Just in case anyone comes to this thread not understanding what’s going on. I feel @jeremy should change the wording he used there between “batch size” and “segments”. As he used “segment” one time when he described it but then interchange it to batch size. I think that is the source of the confusion. The process is: you concatenate the reviews to a long 64M tokens length text. Then you create 64 segments, so let’s say 10M each, left to right of that sequence (64M/64=10M). So now you have a matrix of 10M rows and 64 columns. Now, the bptt part, is the actual batch size. It is ~70 x 64. In the lesson, it is the 75x64 (defined randomly by torchtext). So that is 75/10M. Which you see there. Again, @jeremy says “batch size 64” where i feel it’s more “segment size 64”, batch size 75. So the next batch is what we see below as a tensor of rank 1 length 4800 (75x64) which is basically the 75x64 matrix we saw but starting from the 2nd row. @martinpella did a nice job visualizing it.

1 Like

What is the target variable for traning this model?

@luk I am also confused about this a lot. Are we always trying to predict the last word?