Using an Embedding

The lesson 4 notebook has the following code:

user_in = Input(shape=(1,), dtype=‘int64’, name=‘user_in’)
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)
movie_in = Input(shape=(1,), dtype=‘int64’, name=‘movie_in’)
m = Embedding(n_movies, n_factors, input_length=1, W_regularizer=l2(1e-4))(movie_in)

x = merge([u, m], mode=‘dot’)
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss=‘mse’)[trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.movieId], val.rating))

  • If an embedding is supposed to be just a lookup of an integer to a vector, how are the embeddings the only components of something that can be called a model? ie. where are the weights and how are they used? Are the vectors the “weights” being trained since this is solving a matrix factorization problem?

  • What is the purpose of the merge and flatten?

  • The Keras Embedding function takes an input, defined as:
    Input() is used to instantiate a Keras tensor.
    A Keras tensor is a tensor object from the underlying backend
    (Theano or TensorFlow), which we augment with certain
    attributes that allow us to build a Keras model
    just by knowing the inputs and outputs of the model.
    What does that mean? Why not just use arrays?

The model is identical to what I show in the collaborative filtering spreadsheet. Specifically:

  • The merge() has ‘mode=“dot”’, which means it does a dot product
  • The embedding vectors are initially random
  • The SGD optimizer updates the items of the embedding vectors to find values such that the dot-product is closer to the correct rating.

Try running Solver on the spreadsheet to see this in action.

They are just arrays, but they are arrays of a type that keras (actually theano) knows how to create expressions out of, and (more importantly) knows how to create the derivative of those expressions, and (most importantly!) knows how to compile to run on the GPU. Normal numpy arrays can’t do these things.

I’m a bit confused by a some of the code in Lesson5.ipynb.

The text is first ‘truncated’ at the 5,000th word (where the index for word number 5,000 and all the words after that are set to the same value – this is after sorting was done – the generally makes sense).

Then, we use pad_sequences to front-pad the vector with zeros, now making any corpus 50 words long. So it looks like we really shrunk our text down, making the prior step irrelevant? Or is this only for reviews that were less than 50 words long?

After that, we pass the text into the Embedding() layer, which has an input dimension of 5,000 and and output dimension of 32. So, maybe the 2nd step was irrelevant and we are back with the result of our first step?

I’m not sure what is going on with these 3 things and why each one is so different. Thanks!

Step one decreases the number of unique words to 5,000. Any rare words are replaced with a sentinel value.

Step two decreases (or pads) each sentence to make the sentences the same size. The length of a sentence (step 2) is entirely orthogonal to the size of the vocabulary (step 1). e.g. we have sentences that are each 50 words long, using a vocabulary of 5,000.

Step three is what we saw in the embedding spreadsheet - take a look at the word vectors example there. We define the number of columns of the embedding matrix here. This is unrelated to the length of the sentence (step 2), but it is related to the first step - step one defines the number of rows of the embedding matrix, step 3 defines the number of columns.

Let me know if I’ve helped a little, or just confused things further!

I didn’t realize this was about sentences. I didn’t catch the word “sentence” in the video; I’m not sure it was spoken! What is the benefit to having standardized sentence lengths? Why do we care about sentences, and not just the whole corpus?

(This section is somewhat difficult to follow for me; I think the use of the word ‘truncate’ has has thrown me a bit. It is spoken in context of replacing rare words with sentinel values, and also in the context of sending in only 50 words per minibatch.)

The plot thickens a bit here:
“We have as input, 25,000 sequences of 500 integers. So we take each integer and replace them with a lookup into a 500 column matrix.”

Did you mean to say a looking into a 32 column matrix?

By ‘sentence’ I mean ‘sequence of consecutive characters’. I can see how that’s confusing. It would be good to find a better word!

The output of the lookup is 32. The input is 5000.

Try stepping through the code a line at a time and look at what is the input and output of each line - I’m hopeful you’ll find that explains things. These lessons are really designed to be something that you have alongside you as you go through the code. Here’s a process you may find helpful: