Typing [-1]
is a shortcut
Regarding the fact that we will want to do inference on cpu since we donât want to deal with splitting the data into batches: what if the inference process is way too slow on CPU for it to be useful for our industrial purposes?
Muting Jeremy`s notebook might help to make video lectures without unusual sounds .
Is that the article Jeremy was talking about ?
Understood Iâve been dealing with very overfitted models for one of the Kaggle competitions so probably being hypersensitive.
I meant to ask if we have a return sequences flag to keep the triangle outside the boxâŚbut on hindsight, itâs not a big deal pulling out the last element from the list and doesnât need any more code. Thanks!
Starting to really like PyTorch. Itâs much easier to go up and down the layers of abstraction. Canât believe Jeremy inspected different layers in the network by passing values directly to them to check on Shape and did the entire RNN with just linear layers and for
loop that was so clear and concise. That thing so much harder in static frameworks.
Just posted the lesson video to the top post.
You can either use more CPUs, or switch to using GPU for inference.
Yes thatâs it! Want to pop it in the wiki post for us?
Yeah this lesson was way easier to teach in pytorchâŚ
@yinterian Can you please share the slides that were shown in the class - related to the entity embedding paper and also the simple diagrams at the end?
I think this isnât wikified as of nowâŚ
Oops! Fixed now. Thanks
I have a few questions here
- In the RNN model, we are building an RNN model from scratch using pytorch. In that in the init function, we have
self.e = nn.Embedding(vocab_size, n_fac)
But in the forward function we use self.e(c1). How does this tally? c1 is of a certain sequence length and not equal to vocab_size. Need someone to explain how this fits in?
- Are we using md to feed in the various batches of c1,c2, c3?
md = ColumnarModelData.from_arrays(â.â, [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)
-
Towards the end of the class there was a question on sequence length and the initial sequence being a bunch of zeros. I didnât get the questions and hence the answer. It would be great if this can be explained as well.
-
Are we setting only the first hidden state layer weights to identity matrix by using the below code?
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))
How does this help to contain exploding or vanishing gradients if the weights of other layers are different from this?
- Lastly, whatever I have known about RNNs involve LSTMs. I did not hear Jeremy mention that in the class. Are we going to see that in this part or in the next part?
For some languages (such as Chinese), a character is a word.
From these forum threads a group of initiative students under Jeremy management can easily create a book: âDeep learning in fastaiâ.
Video timelines for Lesson 6
-
00:00:10 Review of articles and works
"Optimization for Deep Learning Highlights in 2017" by Sebastian Ruder,
âImplementation of AdamW/SGDW paper in Fastaiâ,
âImproving the way we work with learning rateâ,
âThe Cyclical Learning Rate techniqueâ -
00:02:10 Review of last week âDeep Dive into Collaborative Filteringâ with MovieLens, analyzing our model, âmovie biasâ, â@propertyâ, âself.models.modelâ, âlearn.modelsâ, âCollabFilterModelâ, âget_layer_groups(self)â, âlesson5-movielens.ipynbâ
-
00:12:10 Jeremy: âI try to use Numpy for everything, except when I need to run it on GPU, or derivativesâ,
Question: âBring the model from GPU to CPU into production ?â, move the model to CPU with âm.cpu()â, âload_model(m, p)â, back to GPU with âm.cuda()â, âzip()â function in Python -
00:16:10 Sort the movies, John Travolta Scientology worst movie of all time âBattlefield Earthâ, âkey=itemgetter()jjâ, âkey=lambdaâ
-
00:18:30 Embedding interpration, using âPCAâ from âsklearn.decompositionâ for Linear Algebra
-
00:24:15 Looking at the âRossmann Retail / Storeâ Kaggle competition with the âEntity Embeddings of Categorical Variablesâ paper.
-
00:41:02 âRossmannâ Data Cleaning / Feature Engineering, using a Test set properly, Create Features (check the Machine Learning âML1â course for details), âapply_catsâ instead of âtrain_catsâ, âpred_test = m.predict(True)â, result on Kaggle Public Leaderboard vs Private Leaderboard with a poor Validation Set. Example: Statoil/Iceberg challenge/competition.
-
00:47:10 A mistake made by Rossmann 3rd winner, more on the Rossmann model.
-
00:53:20 âHow to write something that is different than Fastai libraryâ
-
PAUSE
-
00:59:55 More into SGD with âlesson6-sgd.ipynbâ notebook, a Linear Regression problem with continuous outputs. âa*x+bâ & mean squared error (MSE) loss function with ây_hatâ
-
01:02:55 Gradient Descent implemented in PyTorch, âloss.backward()â, â.grad.data.zero_()â in âoptim.sgdâ class
-
01:07:05 Gradient Descent with Numpy
-
01:09:15 RNNs with âlesson6-rnn.ipynbâ notebook with Nietzsche, Swiftkey post on smartphone keyboard powered by Neural Networks
-
01:12:05 a Basic NN with single hidden layer (rectangle, arrow, circle, triangle), by Jeremy,
Image CNN with single dense hidden layer. -
01:23:25 Three char model, question on âin1, in2, in3â dimensions
-
01:36:05 Test model with âget_next(inp)â,
Letâs create our first RNN, why use the same weight matrices ? -
01:48:45 RNN with PyTorch, question: âWhat the hidden state represents ?â
-
01:57:55 Multi-output model
-
02:05:55 Question on âsequence length vs batch sizeâ
-
02:09:15 The Identity Matrix (init!), a paper from Geoffrey Hinton âA Simple Way to Initialize Recurrent Networks of Rectified Linear Unitsâ
I really didnât get the EmbeddingDotBias
object from this lesson and the âMovie biasâ part of the lesson5-movielens notebook.
How is a bias embedding matrix able to tell us what is the best or the worse movie of all times? How can we infer bias = best/worse movie in our case? (btw we did a lookup on the top 3000 movies with movie_bias = to_np(m.ib(V(topMovieIdx)))
so how are we supposed to find the worse movies?)
Itâs even more confusing as @jeremy takes a different approach for âEmbedding interpretationâ where he goes from plotting the scores of reduced dim of the embedding and then guessing what the relationship between the top and bottom items.
Anyone can shed some lights about this? Thanks
The way I find easiest to understand this is with the spreadsheet where I showed how the dot product and bias fit together. Does anyone else have other ways to think about this which might provide more intuition for @Ekami?