Lesson 6 In-Class Discussion

Typing [-1] is a shortcut :stuck_out_tongue:

Regarding the fact that we will want to do inference on cpu since we don’t want to deal with splitting the data into batches: what if the inference process is way too slow on CPU for it to be useful for our industrial purposes?

Muting Jeremy`s notebook might help to make video lectures without unusual sounds :thinking:.

Is that the article Jeremy was talking about ?

2 Likes

Understood :slight_smile: I’ve been dealing with very overfitted models for one of the Kaggle competitions so probably being hypersensitive.

I meant to ask if we have a return sequences flag to keep the triangle outside the box…but on hindsight, it’s not a big deal pulling out the last element from the list and doesn’t need any more code. Thanks!

Starting to really like PyTorch. It’s much easier to go up and down the layers of abstraction. Can’t believe Jeremy inspected different layers in the network by passing values directly to them to check on Shape and did the entire RNN with just linear layers and for loop that was so clear and concise. That thing so much harder in static frameworks.

Just posted the lesson video to the top post.

3 Likes

You can either use more CPUs, or switch to using GPU for inference.

Yes that’s it! Want to pop it in the wiki post for us?

Yeah this lesson was way easier to teach in pytorch… :slight_smile:

@yinterian Can you please share the slides that were shown in the class - related to the entity embedding paper and also the simple diagrams at the end?

I think this isn’t wikified as of now…

Oops! Fixed now. Thanks :slight_smile:

I have a few questions here

  1. In the RNN model, we are building an RNN model from scratch using pytorch. In that in the init function, we have

self.e = nn.Embedding(vocab_size, n_fac)

But in the forward function we use self.e(c1). How does this tally? c1 is of a certain sequence length and not equal to vocab_size. Need someone to explain how this fits in?

  1. Are we using md to feed in the various batches of c1,c2, c3?

md = ColumnarModelData.from_arrays(’.’, [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

  1. Towards the end of the class there was a question on sequence length and the initial sequence being a bunch of zeros. I didn’t get the questions and hence the answer. It would be great if this can be explained as well.

  2. Are we setting only the first hidden state layer weights to identity matrix by using the below code?

m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))

How does this help to contain exploding or vanishing gradients if the weights of other layers are different from this?

  1. Lastly, whatever I have known about RNNs involve LSTMs. I did not hear Jeremy mention that in the class. Are we going to see that in this part or in the next part?
1 Like

For some languages (such as Chinese), a character is a word.

From these forum threads a group of initiative students under Jeremy management can easily create a book: “Deep learning in fastai”. :face_with_monocle:

1 Like

Video timelines for Lesson 6

  • 00:00:10 Review of articles and works
    "Optimization for Deep Learning Highlights in 2017" by Sebastian Ruder,
    “Implementation of AdamW/SGDW paper in Fastai”,
    “Improving the way we work with learning rate”,
    “The Cyclical Learning Rate technique”

  • 00:02:10 Review of last week “Deep Dive into Collaborative Filtering” with MovieLens, analyzing our model, ‘movie bias’, ‘@property’, ‘self.models.model’, ‘learn.models’, ‘CollabFilterModel’, ‘get_layer_groups(self)’, ‘lesson5-movielens.ipynb’

  • 00:12:10 Jeremy: “I try to use Numpy for everything, except when I need to run it on GPU, or derivatives”,
    Question: “Bring the model from GPU to CPU into production ?”, move the model to CPU with ‘m.cpu()’, ‘load_model(m, p)’, back to GPU with ‘m.cuda()’, ‘zip()’ function in Python

  • 00:16:10 Sort the movies, John Travolta Scientology worst movie of all time “Battlefield Earth”, ‘key=itemgetter()jj’, ‘key=lambda’

  • 00:18:30 Embedding interpration, using ‘PCA’ from ‘sklearn.decomposition’ for Linear Algebra

  • 00:24:15 Looking at the “Rossmann Retail / Store” Kaggle competition with the ‘Entity Embeddings of Categorical Variables’ paper.

  • 00:41:02 “Rossmann” Data Cleaning / Feature Engineering, using a Test set properly, Create Features (check the Machine Learning “ML1” course for details), ‘apply_cats’ instead of ‘train_cats’, ‘pred_test = m.predict(True)’, result on Kaggle Public Leaderboard vs Private Leaderboard with a poor Validation Set. Example: Statoil/Iceberg challenge/competition.

  • 00:47:10 A mistake made by Rossmann 3rd winner, more on the Rossmann model.

  • 00:53:20 “How to write something that is different than Fastai library”

  • PAUSE

  • 00:59:55 More into SGD with ‘lesson6-sgd.ipynb’ notebook, a Linear Regression problem with continuous outputs. ‘a*x+b’ & mean squared error (MSE) loss function with ‘y_hat’

  • 01:02:55 Gradient Descent implemented in PyTorch, ‘loss.backward()’, ‘.grad.data.zero_()’ in ‘optim.sgd’ class

  • 01:07:05 Gradient Descent with Numpy

  • 01:09:15 RNNs with ‘lesson6-rnn.ipynb’ notebook with Nietzsche, Swiftkey post on smartphone keyboard powered by Neural Networks

  • 01:12:05 a Basic NN with single hidden layer (rectangle, arrow, circle, triangle), by Jeremy,
    Image CNN with single dense hidden layer.

  • 01:23:25 Three char model, question on ‘in1, in2, in3’ dimensions

  • 01:36:05 Test model with ‘get_next(inp)’,
    Let’s create our first RNN, why use the same weight matrices ?

  • 01:48:45 RNN with PyTorch, question: “What the hidden state represents ?”

  • 01:57:55 Multi-output model

  • 02:05:55 Question on ‘sequence length vs batch size’

  • 02:09:15 The Identity Matrix (init!), a paper from Geoffrey Hinton “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”

10 Likes

I really didn’t get the EmbeddingDotBias object from this lesson and the “Movie bias” part of the lesson5-movielens notebook.
How is a bias embedding matrix able to tell us what is the best or the worse movie of all times? How can we infer bias = best/worse movie in our case? (btw we did a lookup on the top 3000 movies with movie_bias = to_np(m.ib(V(topMovieIdx))) so how are we supposed to find the worse movies?)

It’s even more confusing as @jeremy takes a different approach for “Embedding interpretation” where he goes from plotting the scores of reduced dim of the embedding and then guessing what the relationship between the top and bottom items.

Anyone can shed some lights about this? Thanks

The way I find easiest to understand this is with the spreadsheet where I showed how the dot product and bias fit together. Does anyone else have other ways to think about this which might provide more intuition for @Ekami?

1 Like