Lecture 9B question: what are we predicting in (Sohl-Dickstein et al. 2015)?

I am posting a question about Lesson 9B - the math of diffusion - YouTube here, as the forum link in the video description is broken.

In diffusion lectures 9 and 9B, the models are explained in terms of predicting noise. To my understanding such explanation is consistent with DDPM and subsequent models, however, (Sohl-Dickstein et al. 2015) appear to be predicting the mean and covariance of normal distributions rather than the noise:

During learning only the mean and covariance for a Gaussian diffusion kernel, or the bit flip probability for a binomial kernel, need be estimated. As shown in Table
App.1, \mathbf{f}_\mu\left(\mathbf{x}^{(t)}, t\right) and \mathbf{f}_\Sigma\left(\mathbf{x}^{(t)}, t\right) are functions defining the mean and covariance of the reverse Markov transitions

Consequently, I have the following questions:

  1. Are (Sohl-Dickstein et al. 2015) indeed predicting mean and covariance or is that just a matter of mathematical notation and in practice, the neural networks are trained on predicting noise?

If the former is the case in and we are training neural networks to predict parameters of normal distributions, then:

  1. What is the basic structure of our neural networks, do they take a noisy image \mathbf{x}^{(t)} and timestep t as input and return a matrix of mean and variance values for each pixel?

  2. Given networks \mathbf{f}_\mu and \mathbf{f}_\Sigma how do we then generate images? Is it just a mater of sampling each pixel’s value from it’s coresponding normal distribution, where each pixel’s normal distribution is defined by parameters predicted by a neural net?

Answers to any questions above would be very welcome!

Although (Sohl-Dickstein et al. 2015) provide a reference implementation, I found it difficult to find the answers to the questions above there, as the code is written in a fairly obscure framework (i.e. Blocks).

DDPM and Sohl-Dickstein et al are different. The DDPM paper explains why switching to just predicting noise works well. Since DDPM, Karras et al and others have used a mix of noise and image prediction.

Karras is probably the best paper to read to understand the options and differences between them.

I’ve not attempted to implemented Sohl-Dickstein et al directly myself because it’s extremely slow and is more complex than DDPM and following approaches.


Thank you.