Lesson 12 wiki

Lesson Resources

Papers mentioned in the class

Grammar as a Foreign Language
Neural Machine Translation by Jointly Learning to Align and Translate

Other items of interest

Jupyter notebook widgets

These were used in the hackathon winning demo that we saw to make something that is interactive, running inside a notebook.

Jupyter notebook widgets github repo
Jupyter notebook widgets tutorial
More widget tutorials

PyTorch tutorials

Table of Contents

  1. Warm-up: numpy
  2. PyTorch: Tensors
  3. PyTorch: Variables and autograd
  4. PyTorch: Defining new autograd functions
  5. TensorFlow: Static Graphs
  6. PyTorch: nn
  7. PyTorch: optim
  8. PyTorch: Custom nn Modules
  9. PyTorch: Control Flow and Weight Sharing

Table of Contents

  1. Tensor multiplication
  2. Linear Regression
  3. Logistic Regression
  4. Neural Network
  5. Modern Neural Network
  6. Convolutional Neural Network

Massive Exploration of Neural Machine Translation Architectures

This is a very recent research paper on the subject (released last week 21 Mar 2017) :


This is a very recent research paper on the subject (released last week 21 Mar 2017) :
Massive Exploration of Neural Machine Translation Architectures

Videos and notebooks have been added now. Sorry for the delay.

transcript for lesson12 video added, https://drive.google.com/open?id=0BxXRvbqKucuNS29sRTExbmJRMTA

please let me know if there are things that should change



I have a question regarding mean-shift clustering.

Mean-shift clustering does not require the number of clusters, which
appears to be a significant advantage over k-means. However, it does
require the Gaussian kernel width (or bandwidth) parameter, which
indirectly determines the number of clusters. The question is: is that
really an advantage? First, you do have to provide a parameter, just
like for k-means. Second, this kernel width parameter seems actually
less intuitive than the number of clusters, and thus harder to set. So
is this really an advantage over k-means?

To follow-up on this, as Jeremy pointed out, you can choose bandwidth
automatically by deciding to cover 1/3 of the data in the dataset (or
by some other means). But then, we have to set the coverage parameter
(1/3), which is also somewhat arbitrary, and may again be harder to
choose than number of clusters.

Just to be clear, mean-shift clustering does have a big advantage in
that it does not assume spherical/elliptical clusters. So it seems
like a superior method. I just don’t know if it has an advantage with
regards to selecting number of clusters. One way or another, you end
up having to choose a relatively arbitrary parameter.

Comments welcome.

I think it’s a reasonable default for the parameter - my guess is that most problems will work well with that choice. Whereas there’s no obvious default for ‘k’ in k-means (although there are various algorithms you can use).

I think that’ll depend on the application area. For example, in medicine, cluster analysis is applied to gene expression data to identify disease types. In that case, the number of clusters is often approximately known from clinical practice (by observing distinct disease progression patterns).

One way to test this question would be to find out if mean-shift clustering with the default coverage of 1/3 would produce the clinically-expected numbers of clusters

1 Like

Thanks Lin for your work, the transcript files are extremely helpful for searching up contents quickly. That is a big time saver when I want to review a specific topic. I am just wondering, do you have a list of all transcript URLs we have so far? I think it might be a lot more convenient to have them all listed in one place. Thanks!


Glad you can use them. They all live in one google drive directory:

Let me know if you have trouble accessing it. If you are grep’ing, you might want to grep -i because sometimes my capitalization is creative.

1 Like

Thank you! Again, great work Lin!