Lesson 12 official topic

Of course, there’s also a category-theoretic way of understanding this: differentiation is a functor that takes you from the category of functions on \mathbb{R} to a “weird” version of this category in which functions are “composed” by multiplying them together. (Exercise for the reader: What are the identity arrows in this latter category?) The chain rule is simply the statement that this is indeed a functor.

This isn’t particularly helpful for understanding the chain rule; but it comes in handy if you want to understand something called “automatic differentiation” (which is the context in which I came across this idea).

1 Like

I found these two videos from Karpathy that explains backprop really well.

12 Likes

Here is the website: Explain Paper which has been making rounds in social media, explains a research paper by answering questions from the paper using GPT-3.

It’s worth it to check this out.

1 Like

I’ve reviewed the meanshift notebook and adding an alternative torch implementation and some plots that should help to understand how the data, batching and the various steps contribute to the final result.

Here you can find the complete notebook: fastaisf/02_meanshift-modified.ipynb at master · artste/fastaisf · GitHub

Spoiler alert: it contains some einsum stuff :rofl:

6 Likes

But under the calculus of infinitesimals, it is kosher, because it is a fraction :smiley:

6 Likes

Well, that takes us into nonstandard analysis, which requries an entirely separate treatment (which I’m not qualified to provide)!

1 Like

By the way, I thought of a more intuitive way of explaining why composing two linear functions multiplies their slopes: You have to think of a linear function as an affine transformation of the real line. E.g., the function f(x) = mx + c scales the real line by a factor of m, and then shifts it by c.

Now, what happens if we compose f(x) = mx + c and g(x) = nx + d? Well, first we scale by m, then shift by c, then scale by n, then shift by d. To make this easier to imagine, think about what happens to a unit interval. It should be obvious that after applying f, we go from an interval of size 1 to an interval of size m. Then, after applying g, we go from an interval of size m to an interval of size nm! Hence, the slopes (i.e. the scaling factors) are multiplied under composition.

1 Like

A bit random… but I thought I’d share anyway…

Playing with @HuggingFace Spaces/Gradio. I had wanted to wire this up to Stable Diffusion but its a bit too taxing for the “free tier” CPU instances…

Give my interactive 3D depth viewer (three.js) a try by just dragging and dropping an image to the Spaces app and drag around with a mouse (or your finger). There are some other Spaces that attempt to do this… But I think mine works a lot better… Correcting for the “camera view pyramid” etc…

https://huggingface.co/spaces/johnrobinsn/MidasDepthEstimation…

Uses the MiDAS Depth Estimation Model to estimate a depth map from pixels.

https://arxiv.org/abs/1907.01341

My twitter post if you’d like to give me a like :slight_smile:

Thanks!
John

4 Likes

What about non-linear functions? How to think about them in this regard?

The idea is that when we are talking about derivatives, we can think of any differentiable function as being approximately linear around each point. So, for the purpose of computing the derivative of g(f(x)) at x=x_0, we can behave as if f(x) is simply the tangent of f at x_0, and g(u) is simply the tangent of g at f(x_0).

2 Likes

This makes sense. Thanks.

1 Like

I found this huge list of Stable Diffusion resources: SD RESOURCE GOLDMINE.

3 Likes

Totally agree The spelled-out intro to neural nets is a masterpiece!
It really helps you to connect the dots and have a good grasp of what is happening under the hood.

A good resource to watch looking forward to next lesson on backprop :fire:

7 Likes

Jax/Flax vs PyTorch

Is Jax/Flax just Google’s answer to Torch? Pros/Cons vs Torch? Just looking at random colabs etc and seeing some in Jax are there any particular advantages (ie things you can do in Jax that you can’t do in torch)? Anyone have experience with torch and jax?

Thanks Much
John

Jax is perhaps slightly lower-level than PyTorch, in that it provides what is basically numpy with autograd magic and XLA compilation for FAST execution on GPUs/TPUs. I haven’t used it much but
impressions from pros to cons

  • Very fast when you use all the JIT compilation magic, and great for making use of TPUs
  • Low-level, which can be fun for learning and makes you feel less reliant on a bunch of high-level libraries and APIs
  • vmap() means you can write a function that works for a single example and it’ll turn it into one that works with batches of data
  • Some interesting libraries emerging like equinox which I enjoyed dabbling with
  • Jax sort of forces a specific kind of coding on you, which can feel weird at first but does seem somehow… elegant? Does take a bit of getting used to though.
  • You will be forced to think about random numbers and state, which is good but also extra work
  • Definitely harder to debug (although they’re working on that)
  • Smaller ecosystem, so you’ll have to do more things yourself vs PyTorch land where you can usually find a library or some code that does what you want.
  • Ditto for documentation, small but growing.

At the moment the usual joke is that people using Jax are mostly Google folks who don’t want to use TensorFlow, but I think it’s cool that it exists and suspect it’ll keep growing into a more and more useful part of the overall ecosystem.

15 Likes

Thanks for the great overview!

Homework 1: Is the below right approach or can squaring & summing be done in one operation?

a=(x-X)
b=torch.einsum(‘ik,ik->ik’, [a, a])
c=torch.einsum(‘ij->i’,[b])
c[:8].sqrt()

2 Likes

I came up with…

a = x-X
b = torch.einsum(‘ij,ij->i’,a,a).sqrt()

by dropping the j dimension after the arrow you can do the sum (on dim 1)

6 Likes

why do you add *70-35 after the torch.rand() call below? Is it there for more randomness?

centroids = torch.rand(n_clusters, 2)*70-35

That how you create uniform random variable between -35 and 35. No particular reason I picked those params - just wanted something that was generally not too big and not too small for talking though.