Overlapping windows that cross documents/examples in RNN / NLP training data? (Lesson 4/6 question)

In lecture 6 (CharNN) and 4 (Language Model) we take windows of text input and use that to train a series of embeddings and a network to predict the next character/word. When we’re constructing the input data it seems commonplace to concatenate one example to the next. This leads to situations where windows of the training data looks like:

This is the end of example 1. This is the beginning of example 2.

which taken to extremes could be:

This is a sentence about dogs and cats. Deep learning is the topic of this sentence.

And now our model is associating dogs and cats with deep learning. Sometimes there’s an EOF or EOL marker, but in many cases there isn’t any separation.

With this in mind I have a few questions:

  • Is the assumption here that we’re going to be training the set on such a large corpus that these windows of overlap don’t matter? Presumably because the order of examples is random and the ratio of in example windows to cross example windows is high? If so is this something we need to check for?
  • If some of the examples in our training data are short does that guide our selection of window size?
  • Do we ever want to put special spacer characters in between our examples? And if so what impact does that have on the embeddings and models that are built?

I get the sense that this isn’t well studied and is taken for granted in NLP as the way to do things, but thinking cross domain it seemse to me this would be equivilent in computer vision to padding the images with the previous input image which doesn’t make it sound like a very good idea.

Does anyone have any good references on this or any thoughts?

I’m not sure if this is explicitly written down anywhere, but a common approach is to only put end of line characters at the end of a document. That way, the model can learn this is a “change of subject”. But you can use any marker you like. In IMDB, IIRC, every review is on a single line, so this works fine.

1 Like