Lesson 4 IMDB data, sequence of words fed into RNN

anuclearbomb · February 23, 2018, 2:05am

Hi all,

a question on the sequence of words fed into the RNN.

In the image below, the words (corpus of IMDB reviews) are first arranged in continuous columns. However, when the words are flattened into a 1d vector, they are flattened horizontally. (i.e. 35, 7 33 … 22, 3885, 21587). If i am not wrong, this is not a comprehensible sequence of words fed into the RNN. Could I know why is this so?

o?

davecazz · February 23, 2018, 2:32am

Hi Nuke,

The words are tokenized before being sent into the RNN. so each number represents a word, space, or other language element.

not sure how it tokenized your example, but 35 could be space, 12 could be the word ‘and’, etc

anuclearbomb · February 23, 2018, 2:38am

Hi Dave,

Thanks for the tip.

This is a screenshot from Jeremy’s notebook and in the video lecture, he said that the corpus was arranged continuously in vertical columns. I.e. column 1 is one long sentence while column 2 continues where column 1 ends. Hopefully I am not wrong at this.

However, the lower part of the image shows that the 77x64 array is flattened into a size 4928 vector, in the order of 35,7,33…22,2885,21587 . My question is that this doesnt seem to be a comprehensible (at least to human) way of arranging the words fed into the RNN. Is there a reason for this?

Thanks.

davecazz · February 23, 2018, 3:03am

so they have the same size. just the second one is flattened. 77x64 = 4928

in the first matrix, sequences are ordered from top (first word) to bottom (77th word), while each column is a batch that gets computed in parallel.

the second matrix represents the first matrix shifted by one one element in each sequence. so the top row of the top matrix is shifted out of the matrix.

if you look at the sequence of numbers in the second matrix. 35, 7, 33… and then look at the numbers in the second column of the first 35, 7, 33… you’ll see they match. when the second matrix was flattened the elements will be ordered one row after the other.

anuclearbomb · February 23, 2018, 3:08am

Guess i cant describe my problem clearly without actually running the notebook to get an index to words demostraton .

I will try to phrase my question in a better way once I can run the notebook. Thanks a lot Dave!

davecazz · February 23, 2018, 3:31am

if you’re saying looking at those numbers is like looking at the matrix and that it’s impossible to easily see what they are. then yes. it’s not really human readable.

but you can look at the TEXT.vocab.itos and stoi dictionaries to figure out what words are actually represented in the array. there are a couple of samples in the notebook for converting arrays to strings.