Lesson 6 In-Class Discussion

I think this isn’t wikified as of now…

Oops! Fixed now. Thanks :slight_smile:

I have a few questions here

  1. In the RNN model, we are building an RNN model from scratch using pytorch. In that in the init function, we have

self.e = nn.Embedding(vocab_size, n_fac)

But in the forward function we use self.e(c1). How does this tally? c1 is of a certain sequence length and not equal to vocab_size. Need someone to explain how this fits in?

  1. Are we using md to feed in the various batches of c1,c2, c3?

md = ColumnarModelData.from_arrays(’.’, [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

  1. Towards the end of the class there was a question on sequence length and the initial sequence being a bunch of zeros. I didn’t get the questions and hence the answer. It would be great if this can be explained as well.

  2. Are we setting only the first hidden state layer weights to identity matrix by using the below code?


How does this help to contain exploding or vanishing gradients if the weights of other layers are different from this?

  1. Lastly, whatever I have known about RNNs involve LSTMs. I did not hear Jeremy mention that in the class. Are we going to see that in this part or in the next part?
1 Like

For some languages (such as Chinese), a character is a word.

From these forum threads a group of initiative students under Jeremy management can easily create a book: “Deep learning in fastai”. :face_with_monocle:

1 Like

Video timelines for Lesson 6

  • 00:00:10 Review of articles and works
    "Optimization for Deep Learning Highlights in 2017" by Sebastian Ruder,
    “Implementation of AdamW/SGDW paper in Fastai”,
    “Improving the way we work with learning rate”,
    “The Cyclical Learning Rate technique”

  • 00:02:10 Review of last week “Deep Dive into Collaborative Filtering” with MovieLens, analyzing our model, ‘movie bias’, ‘@property’, ‘self.models.model’, ‘learn.models’, ‘CollabFilterModel’, ‘get_layer_groups(self)’, ‘lesson5-movielens.ipynb’

  • 00:12:10 Jeremy: “I try to use Numpy for everything, except when I need to run it on GPU, or derivatives”,
    Question: “Bring the model from GPU to CPU into production ?”, move the model to CPU with ‘m.cpu()’, ‘load_model(m, p)’, back to GPU with ‘m.cuda()’, ‘zip()’ function in Python

  • 00:16:10 Sort the movies, John Travolta Scientology worst movie of all time “Battlefield Earth”, ‘key=itemgetter()jj’, ‘key=lambda’

  • 00:18:30 Embedding interpration, using ‘PCA’ from ‘sklearn.decomposition’ for Linear Algebra

  • 00:24:15 Looking at the “Rossmann Retail / Store” Kaggle competition with the ‘Entity Embeddings of Categorical Variables’ paper.

  • 00:41:02 “Rossmann” Data Cleaning / Feature Engineering, using a Test set properly, Create Features (check the Machine Learning “ML1” course for details), ‘apply_cats’ instead of ‘train_cats’, ‘pred_test = m.predict(True)’, result on Kaggle Public Leaderboard vs Private Leaderboard with a poor Validation Set. Example: Statoil/Iceberg challenge/competition.

  • 00:47:10 A mistake made by Rossmann 3rd winner, more on the Rossmann model.

  • 00:53:20 “How to write something that is different than Fastai library”


  • 00:59:55 More into SGD with ‘lesson6-sgd.ipynb’ notebook, a Linear Regression problem with continuous outputs. ‘a*x+b’ & mean squared error (MSE) loss function with ‘y_hat’

  • 01:02:55 Gradient Descent implemented in PyTorch, ‘loss.backward()’, ‘.grad.data.zero_()’ in ‘optim.sgd’ class

  • 01:07:05 Gradient Descent with Numpy

  • 01:09:15 RNNs with ‘lesson6-rnn.ipynb’ notebook with Nietzsche, Swiftkey post on smartphone keyboard powered by Neural Networks

  • 01:12:05 a Basic NN with single hidden layer (rectangle, arrow, circle, triangle), by Jeremy,
    Image CNN with single dense hidden layer.

  • 01:23:25 Three char model, question on ‘in1, in2, in3’ dimensions

  • 01:36:05 Test model with ‘get_next(inp)’,
    Let’s create our first RNN, why use the same weight matrices ?

  • 01:48:45 RNN with PyTorch, question: “What the hidden state represents ?”

  • 01:57:55 Multi-output model

  • 02:05:55 Question on ‘sequence length vs batch size’

  • 02:09:15 The Identity Matrix (init!), a paper from Geoffrey Hinton “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”


I really didn’t get the EmbeddingDotBias object from this lesson and the “Movie bias” part of the lesson5-movielens notebook.
How is a bias embedding matrix able to tell us what is the best or the worse movie of all times? How can we infer bias = best/worse movie in our case? (btw we did a lookup on the top 3000 movies with movie_bias = to_np(m.ib(V(topMovieIdx))) so how are we supposed to find the worse movies?)

It’s even more confusing as @jeremy takes a different approach for “Embedding interpretation” where he goes from plotting the scores of reduced dim of the embedding and then guessing what the relationship between the top and bottom items.

Anyone can shed some lights about this? Thanks

The way I find easiest to understand this is with the spreadsheet where I showed how the dot product and bias fit together. Does anyone else have other ways to think about this which might provide more intuition for @Ekami?

1 Like

Regarding dim=-1 in fastai/courses/dl1/lesson6-rnn.ipynb

Has anybody else run into this error?:
TypeError: log_softmax() got an unexpected keyword argument 'dim'

It happens in the last line in the rnn models where the softmax is called:

    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        return F.log_softmax(self.l_out(h), dim=-1) <===

Looks like it should be changed to:

        return F.log_softmax(self.l_out(h)) <===

Looking at this thread:

it seems that this parameter has been changed in pytorch and should be removed from CharLoopModel, CharLoopConcatModel and CharRnn. It seems that probably the most up-to-date notebook is not in github (?). For now in my local copy I have just removed it. @jeremy do you have a more up-to-date version from the lesson that you could check in?


Pytorch 0.3.0’s softmax function accepts the dim argument.

You probably need to do a git pull + conda env update. There was an update that switches the pytorch channel from “sousmith” to “pytorch” which results in grabbing pytorch 0.3.0 rather than 0.2.0.


Strange I had updated both fastai and my environment. I saw this post:

so I ran:

git pull
source activate fastai
conda env update

from These 4 lines will solve 80% of your problems

and saw that 0.3.0 was installed:
pytorch-0.3.0- 100% |################################|

If it is supposed to accept the parameter, that is helpful. I had thought it had been removed. I’ll go through the steps to update again to see if I can get the right behavior. Thanks.


@Ekami, just in case this is still unsolved in your head, after step number one that @jeremy suggested, understanding the spreadsheet, more ideas:

The question is that with “latent factors” of your embedding you are capturing interations between user-movie, that is, how much dialogue does the movie has? interacts with how much does this user love/hate long dialogs? But bias terms are not interacting, they are user specific or movie specific. So, by having a look at the movie biases that your SGD has learnt you are seeing the especific goodness/badness of a movie… more or less.

I say more or less cause the way I see it, more than “the best/worst movies of all times” maybe would be “the best/worst movies in its class”. Or the movies that sharing similar latent factors with others somehow got much better or much worse critics than those other smilar movies. Its a more subtle way of assesing the goodnes of a movie, beyond how many stars it was rated (otherwise you would just take the movies with 1 star as worse and the movies with 5 stars as best with no machine learning at all :grinning:)

And about reducing dimensionality for representation… I dont think its conceptually different from taking just size two or size three embeddings and watching 2D or ·3D plots to gain intuition about where movies are, not really that confusing :wink:


Thanks a lot for taking the time to explain it to me @miguel_perez :slight_smile: . It’s clearer now but I still feel like I’m missing some pieces such as “what is a latent factor” and few other details. I’ll find them by myself and come back to your explanation. That really helps, thanks a lot :slight_smile:

1 Like

A factor is something that causes another thing. For example, sunlight causes energy.

A latent factor is also a factor, but it’s hard to measure directly. You know it is present and causing something, but it’s hard to measure it.

In the movies example, “dialogue rich” or “action-comedy” are for example, 2 latent factors for movies. There is no scale to measure them directly other than a real number between 0&1 showing how much of ‘dialogue richness’ or ‘action-comedyness’ is there in the movie.

For users, say 2 latent factors are: ‘love for dialogues’ and ‘aversion for action-comedy type fights’ etc. When you multiply the user factors with the ‘corresponding’ movie factors, and sum them up you get a score which can be equated to a rating that user would give a particular movie.

In collaborative filtering, we go from known ratings to inferring latent factors. Then, using those inferred latent factors, predict the unknown ratings(for recommending movies to users).

Hope this helps.


Thanks a lot for these very clear informations :slight_smile:

1 Like

I am reviewing RNNs and a have a very basic question. I didn’t quite understand why we are not including the final sequence when creating c_in_dat:

Thanks !

IIRC it’s because the final sequence doesn’t have a label.

I posted this question on Wiki: Lesson 6 topic, but wasn’t able to figure it out. Would anybody point me to the right direction?

Thank you!!

In simplest terms my understanding is that:
Autoencoder reconstructs the input image accurately.
Variational Autoencoder reconstructs variational versions of input image as output, for a given input image there can be multiple images which are close to the input image but with certain difference. In VAE latent space takes up probability distribution.

So when we do this, to_np(m.ib(V(topMovieIdx).cpu())) does this make a prediction for any set of movies?