Lesson 4 discussion

@ccrome Good question! To account for a new user or movie, you will need to add an additional column to your input. The question is: how to initialize this new column? This is known as the “cold-start problem” (if you want to Google for more info on it), and in practice, people typically use meta-data to make a best initial prediction for a user or movie that they don’t have any ratings for yet.

1 Like

Thanks for the reply Rachel, but I’m not sure I understand. I understand about making the best initial guess of where a user might fall into your model. You might try to guess that the new user is closest to some already extant user.

What I’m unclear on is, what’s the mechanical mechinsm for adding user IDs and Movie IDs to the system?

At about 32:22 of the lesson 5 video, somebody asked the same question, but I’m not sure Jeremy answered the question. What was suggested is to use something like model.predict([userID, movieID]). While that works fine for userIDs and movieIDs that already exist, I think the question is about new userIDs and new movieIDs.

Let’s take a concrete example from the lesson 4 notebook. Say my model is built up and trained with, 100 users (n_users), and I build a corresponding embedding:

u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)

Then, a new user logs in and rates a couple movies. My n_users is now 101.

Now what? Do you reach into the model, and replace u with a new embedding something like…:

u2 = Embedding(n_users /* 101 now */, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in) u2.weights[0:100] = u.weights

and somehow try to stick u2 into the model? Clearly that’s not going to work, right? because the next layer up is expecting the u embeeding to have 100 weights.

Same question goes for new movies obviously. When a new movie comes out, do you need to re-train the whole model, or are there ways to maybe reserve space for userIDs and movieIDs that don’t exist yet?

Thanks,
-Caleb

2 Likes

That’s exactly the approach, and indeed it will work since the dimensionality of the output of the embedding is unchanged. Only the input changes.

3 Likes

Oh, Of course! The output dimensionality is the important bit for the succeeding layers. Thanks for the clarification.

-Caleb

Jeremy told us about saddle points in Lesson 4 but I was not able to understand it intuitively. I searched the net for the definition and it said a saddle point for a non convex function is a point where the gradient is 0 yet it is not a local minima. For example: y = x1^2 -x2^2 has gradient 0 at (0,0) yet it is not a minima for this function.

Now momentum as explained by Jeremy us combining the decaying average of previous gradients with current gradient. That brings me to two questions:

  1. What if we start our learning initialization from the saddle point. The gradient would be 0 in one direction and there momentum also couldn’t help since the starting point is the saddle point we would never move in the direction of zero gradient. Is this the correct interpretation. Is this possible?

  2. Also I am thinking about Saddle points visually as a half cut bottle horizontally (so that the height of the bottle remains same) kept on a horizontal surface. in such case the gradient in one direction is 0 while the other direction the virtual ball just oscillates. Am I correct in such sort of thinking?

Sorry if the question appears naive…

@mlwhiz

  1. Even if the gradient is 0 in one direction, you will move in the other directions and then re-evaluate the gradient (at which point it will most likely not be 0 in that same direction anymore). You are assuming that the gradient for each dimension is independent of the values of the other dimensions, which is not true.

  2. The situation that arises very often in DL (and that momentum helps combat) is to picture a canyon, where the canyon floor is sloping downwards. Say there’s a stream running along the bottom. You would like to follow the stream directly (which would take you to lower and lower altitudes), but typical gradient descent has you scale up and down the canyon sides a bit as you follow the stream at the bottom. Momentum overcomes this, because the up/down the canyon side cancels out.

In statefarm.ipynb notebook provided there is a step where convolution layer features are pre-computed.

  1. Here predict_generator is being run on ‘batches’ which is setup with 'shuffle=True’
    conv_feat = conv_model.predict_generator(batches, batches.nb_sample)
    conv_val_feat = conv_model.predict_generator(val_batches, val_batches.nb_sample)
    conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

batches = get_batches(path+‘train’, batch_size=batch_size)

  1. The trn_labels are got using get_classes() which internally uses get_batches() with shuffle=false
    (val_classes, trn_classes, val_labels, trn_labels,
    val_filenames, filenames, test_filenames) = get_classes(path)

  2. With above, when we do
    bn_model.fit(conv_feat, trn_labels, batch_size=batch_size, nb_epoch=2,
    validation_data=(conv_val_feat, val_labels))
    aren’t we passing conv_feat generated from shuffled input with labels coming from non-shuffled input which looks to be wrong.

Is the above observation correct or am I missing something ?

Thanks,
Ajith

Still hanging on and just listened to lesson 4 video. Only remaining question I have about collaborative filtering is the number of features: why 5? why 50? But I guess that is a hyper-parameter to be found.
P.S. spent some time this week setting up my own process for being able to use spot instances (of p2.xlarge) as using normal aws instances otherwise gets quite expensive. I could post my step-by-step instruction here if of interest

(re-posted becasue I wanted to add to the lesson 4 -thread)

Question:

How can a model obtain a different validation accuracy when tested on the same validation set?

Edit:

The StackOverflow answer and Keras issue give insight into this question.


Elaboration:

The statefarm-sample notebook shows that this can happen, if one shuffles the validation set. When I didn’t shuffle the validation set, the model obtained the same validation accuracy. This confuses me.

You’ll see below that the generator is called rnd_batches and the number of items in the generator is rnd_batches.nb_sample. When using model.evaluate_generator we used rnd_batches.nb_sample as val_samples, “[the] total number of samples to generate from generator before returning.” And so it seems we’re evaluating the model on the entire validation set, which to me, says that shuffling the validation set shouldn’t change the result, since all we’re doing with the model is forward propagation (i.e. we’re not changing the model) and forward propagation is independent of previous forward propagations as long as we didn’t update the weights, which we haven’t.

Scenario:

We have a trained linear model.
We test that model on a validation set.
It receives a validation accuracy of 0.70.
We shuffle the validation set.
We test the model on the same validation set.
It receives a validation accuracy of 0.69.

Here’s the model definition:

model = Sequential([
        BatchNormalization(axis=1, input_shape=(3,224,224)),
        Flatten(),
        Dense(10, activation='softmax')
    ])
model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])

Here’s the testing of the model:

rnd_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=True)
val_res = [model.evaluate_generator(rnd_batches, rnd_batches.nb_sample) for i in range(10)]
np.round(val_res, 2)

array([[ 1.  ,  0.7 ],
       [ 0.99,  0.71],
       [ 1.01,  0.69],
       [ 0.97,  0.7 ],
       [ 1.02,  0.69],
       [ 1.02,  0.68],
       [ 0.99,  0.7 ],
       [ 1.01,  0.69],
       [ 1.  ,  0.7 ],
       [ 1.  ,  0.7 ]])

Without shuffling the validation set:

#shuffle=False

array([[ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7],
       [ 1. ,  0.7]])

I’ve asked on StackOverflow.

I’ve also asked on keras’s github.

3 Likes

Another question:

Can someone say more about finding a sample size that is big enough?

Where “big enough” means that data augmentations that improve the performance of models on the sample also improve the performance of models on the whole dataset.

Elaboration:

In this segment of the Lesson 4 video, Jeremy says:

One obvious question would be, ‘How do you decide how big of a sample to use?’

And what I did was I tried a few different sizes of samples for my validation set, and I then said, ‘okay, evaluate the model on the validation set, but for a whole bunch of randomly sampled batches. So do it ten times.’ So then I looked and I saw how the accuracy changed. Right, and so with the validation set set at 1000 images, my accuracy changes from like .48 or .47 and .51. So it’s not changing too much. It’s small enough that I think, ‘okay, I can make useful insights using a sample of this size.’

Here are the validation accuracies Jeremy is talking about:

array([[ 4.4 ,  0.49],
       [ 4.57,  0.49],
       [ 4.48,  0.48],
       [ 4.28,  0.51],
       [ 4.66,  0.48],
       [ 4.5 ,  0.49],
       [ 4.46,  0.49],
       [ 4.51,  0.47],
       [ 4.45,  0.51],
       [ 4.47,  0.49]])

The training set had 1568 images. The validation set had 1002 images.
Here’s the notebook.

However, I observed the following:

Smaller sample sizes had a lower variance in the validation accuracies.

This suggests to me that the variance in validation accuracies doesn’t tell us whether a sample size is big enough.

Here are the validation accuracies on a tiny sample:

array([[ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05],
       [ 8.6 ,  0.05]])

The training set had 100 images. The validation set had 66 images.
This sample is likely too small to be useful for experiments, yet its val_acc variance is low.

Here are the validation accuracies on a tiny sample with 1002 images in the validation set:

array([[ 8.27,  0.11],
       [ 8.32,  0.11],
       [ 8.21,  0.12],
       [ 8.17,  0.11],
       [ 8.3 ,  0.11],
       [ 8.28,  0.11],
       [ 8.14,  0.12],
       [ 8.19,  0.11],
       [ 8.27,  0.11],
       [ 8.21,  0.11]])

The training set had 100 images. The validation set had 1002 images, the same as Jeremy’s validation set.

Relating this to the first question:

I know I’m missing something. I think the validation accuracy shouldn’t change when the model is the same and the validation set is the same. However, the observations above say this isn’t true, and so I’m confused.

In the statefarm-sample.ipynb, why is “shuffle=False” for the validation batch?

batches = get_batches(path+'train', batch_size=batch_size)
val_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=False)

After spending literally 3 days full time with Jeremy’s code / Lesson4 - I ended up with a stellar result at spot #1350 out of total 1440 total submissions.

Epoch 12/12
60000/60000 [==============================] - 10s - loss: 2.2212 - acc: 0.3846 - val_loss: 1.5359 - val_acc: 0.5186

51% validation accuracy. Validation set has different drivers from training set. How come this ends up in a trash dump of results?

I clearly doing something wrong. This is the last part of code I was working on: http://pastebin.com/UfuxgcZ1

PS: If anyone could paste a piece of A-Z, clean .py code that somehow works for statefarm case - I’d appreciate it.
Lesson4 notebook code caused countless out of memory and syntax errors that took much time to correct and resolve.

After the training is completed, we’ll have final values of estimated latent factors for movies and users, hence for the particular movie and user,we can just multiply vectors and get the predicted rating.

The value of “0.7959” in your Lecture 4 for Movielens is MSE not RMSE; it’s RMSE is 0.89, which is approximately the state-of-the-art as you yourself report.
So it’s not way better than state-of-the-art as Jeremy says…

2 Likes

Getting the following error when trying to fit.

model.fit([trn.userId, trn.itemId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.itemId], val.rating))

Exception Traceback (most recent call last)
in ()
1 model.fit([trn.userId, trn.itemId], trn.rating, batch_size=64, nb_epoch=1,
----> 2 validation_data=([val.userId, val.itemId], val.rating))

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\keras\engine\training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch)
1135 check_batch_axis=False,
1136 batch_size=batch_size)
-> 1137 self._make_test_function()
1138 val_f = self.test_function
1139 if self.uses_learning_phase and not isinstance(K.learning_phase(), int):

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\keras\engine\training.pyc in _make_test_function(self)
780 [self.total_loss] + self.metrics_tensors,
781 updates=self.state_updates,
–> 782 **self._function_kwargs)
783
784 def _make_predict_function(self):

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\keras\backend\theano_backend.pyc in function(inputs, outputs, updates, **kwargs)
967 msg = ‘Invalid argument “%s” passed to K.function’ % key
968 raise ValueError(msg)
–> 969 return Function(inputs, outputs, updates=updates, **kwargs)
970
971

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\keras\backend\theano_backend.pyc in init(self, inputs, outputs, updates, **kwargs)
953 allow_input_downcast=True,
954 on_unused_input=‘ignore’,
–> 955 **kwargs)
956
957 def call(self, inputs):

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\compile\function.pyc in function(inputs, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input)
318 on_unused_input=on_unused_input,
319 profile=profile,
–> 320 output_keys=output_keys)
321 # We need to add the flag check_aliased inputs if we have any mutable or
322 # borrowed used defined inputs

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\compile\pfunc.pyc in pfunc(params, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input, output_keys)
477 accept_inplace=accept_inplace, name=name,
478 profile=profile, on_unused_input=on_unused_input,
–> 479 output_keys=output_keys)
480
481

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\compile\function_module.pyc in orig_function(inputs, outputs, mode, accept_inplace, name, profile, on_unused_input, output_keys)
1775 on_unused_input=on_unused_input,
1776 output_keys=output_keys).create(
-> 1777 defaults)
1778
1779 t2 = time.time()

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\compile\function_module.pyc in create(self, input_storage, trustme, storage_map)
1639 theano.config.traceback.limit = 0
1640 _fn, _i, _o = self.linker.make_thunk(
-> 1641 input_storage=input_storage_lists, storage_map=storage_map)
1642 finally:
1643 theano.config.traceback.limit = limit_orig

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\link.pyc in make_thunk(self, input_storage, output_storage, storage_map)
688 return self.make_all(input_storage=input_storage,
689 output_storage=output_storage,
–> 690 storage_map=storage_map)[:3]
691
692 def make_all(self, input_storage, output_storage):

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\vm.pyc in make_all(self, profiler, input_storage, output_storage, storage_map)
1001 storage_map,
1002 compute_map,
-> 1003 no_recycling))
1004 if not hasattr(thunks[-1], ‘lazy’):
1005 # We don’t want all ops maker to think about lazy Ops.

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\op.pyc in make_thunk(self, node, storage_map, compute_map, no_recycling)
968 try:
969 return self.make_c_thunk(node, storage_map, compute_map,
–> 970 no_recycling)
971 except (NotImplementedError, utils.MethodNotDefined):
972 logger.debug(‘Falling back on perform’)

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\op.pyc in make_c_thunk(self, node, storage_map, compute_map, no_recycling)
877 logger.debug(‘Trying CLinker.make_thunk’)
878 outputs = cl.make_thunk(input_storage=node_input_storage,
–> 879 output_storage=node_output_storage)
880 fill_storage, node_input_filters, node_output_filters = outputs
881

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\cc.pyc in make_thunk(self, input_storage, output_storage, storage_map, keep_lock)
1198 cthunk, in_storage, out_storage, error_storage = self.compile(
1199 input_storage, output_storage, storage_map,
-> 1200 keep_lock=keep_lock)
1201
1202 res = _CThunk(cthunk, init_tasks, tasks, error_storage)

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\cc.pyc in compile(self, input_storage, output_storage, storage_map, keep_lock)
1141 output_storage,
1142 storage_map,
-> 1143 keep_lock=keep_lock)
1144 return (thunk,
1145 [link.Container(input, storage) for input, storage in

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\cc.pyc in cthunk_factory(self, error_storage, in_storage, out_storage, storage_map, keep_lock)
1593 else:
1594 module = get_module_cache().module_from_key(
-> 1595 key=key, lnk=self, keep_lock=keep_lock)
1596
1597 vars = self.inputs + self.outputs + self.orphans

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\cmodule.pyc in module_from_key(self, key, lnk, keep_lock)
1140 try:
1141 location = dlimport_workdir(self.dirname)
-> 1142 module = lnk.compile_cmodule(location)
1143 name = module.file
1144 assert name.startswith(location)

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\cc.pyc in compile_cmodule(self, location)
1504 lib_dirs=self.lib_dirs(),
1505 libs=libs,
-> 1506 preargs=preargs)
1507 except Exception as e:
1508 e.args += (str(self.fgraph),)

C:\Users\Sai Kiran\Anaconda3\envs\python2\lib\site-packages\theano\gof\cmodule.pyc in compile_str(module_name, src_code, location, include_dirs, lib_dirs, libs, preargs, py_module, hide_symbols)
2202 # difficult to read.
2203 raise Exception(‘Compilation failed (return status=%s): %s’ %
-> 2204 (status, compile_stderr.replace(’\n’, '. ')))
2205 elif config.cmodule.compilation_warning and compile_stderr:
2206 # Print errors just below the command line.

Exception: (‘The following error happened while compiling the node’, Elemwise{sqr,no_inplace}(embedding_9_W), ‘\n’, 'Compilation failed (return status=1): C:\Users\Sai Kiran\Anaconda3\envs\python2\libs/python27.lib: error adding symbols: File in wrong format\r. collect2.exe: error: ld returned 1 exit status\r. ', ‘[Elemwise{sqr,no_inplace}(embedding_9_W)]’)

So cool to do everything in Excel. Really proves it’s very simple.

1 Like

I wondered this too. Here’s my answer (with the caveat that it might be wrong since I’m in the middle of taking the course as well):

It’s important to shuffle your training batches to ensure that you’re not using the exact same mini-batches in every epoch. However, since your validation set is used to score the model after each epoch of training, it’s not necessary to shuffle them. Since I think the validation step occurs after you’ve run through the training set, so shuffling / not shuffling should result in the same mean validation accuracy and loss numbers.

That makes sense @mifeng.

I also believe validation batches aren’t shuffled so that the data can be matched to the filenames; this gives you the ability to see where your predictions went right and wrong visually.

thanks

1 Like

To me, this idea of using meta-data to make a best initial prediction for a user seems analogous to doing psuedo-labelling with image data.

In case of image data, we had the initial raw data as the image pixels, while the only initial raw data available in case of a new user or a new movie is the meta-data. To take this a step further, I feel this necessity for meta-data makes the case for creating visualizations of the movies grouped by their latent factors scores in order to understand what the latent factor is actually representing. Once a latent factor is understood, it could then be used to generate the meta-data for a new movie or user.

So when we affix a neural net onto these latent factors, are we not losing this ability to understand why the particular movie is rated so high or so low by a particular user?

When using predict for the neural network in this lesson, we are passing in a user ID and a movie ID to see what the user would probably rate the movie, correct?

Here is what I used:

nn.predict([np.array([3]), np.array([1009])])
array([[ 4.3557]], dtype=float32)