Maybe this is beating a dead horse, but I really wanted to make sure I understand this too.
It seems that with the example that Jeremy used, we took all the reviews and concatenated them together and got a single array that was 64 million words long. Since we are using a vocabulary, we can then replace the words with an integer value (words whose frequency are <10 would be replaced by the equivalent integer for “UNKNOWN”) which allows us to keep the data in matrix format and not rely on characters.
Since we are creating 64 equal-sized batches this would be equivalent to dividing the original 64 million long array into 64 separate 1 million long arrays. Let’s say we label these as B1, B2, …, B64 (each is a row vector 1 million long). Then the new matrix we’ve created is [B1^T B2^T … B64^T], which is to say the transpose of the row vectors (making them column vectors) and then concatenating them (maybe not the correct word to use?) such that we have a matrix of size 1,000,000 x 64.
So there is no problem with BPTT because if you look at the diagram, we are taking a chunk of data that is BPTT x BS in size. Each column (of length BPTT) preserves the word order of the sentences in a review, and I guess we would theoretically be processing sentences from 64 reviews (maybe more if we are at an overlap between reviews). We hit the problem at the very tops and bottoms of the matrix, but this is only going to happen for however big you make your BS to be.
Just to clarify, I’m guessing that actually, we run into this problem more often since our data is actually a concatenation of reviews, so there will be some cases where our BPTT x BS data chunk will have sentences from different reviews in the columns. I guess with such large data this becomes less of a problem.