I am posting a question about Lesson 9B - the math of diffusion - YouTube here, as the forum link in the video description is broken.
In diffusion lectures 9 and 9B, the models are explained in terms of predicting noise. To my understanding such explanation is consistent with DDPM and subsequent models, however, (Sohl-Dickstein et al. 2015) appear to be predicting the mean and covariance of normal distributions rather than the noise:
During learning only the mean and covariance for a Gaussian diffusion kernel, or the bit flip probability for a binomial kernel, need be estimated. As shown in Table
App.1, \mathbf{f}_\mu\left(\mathbf{x}^{(t)}, t\right) and \mathbf{f}_\Sigma\left(\mathbf{x}^{(t)}, t\right) are functions defining the mean and covariance of the reverse Markov transitions
Consequently, I have the following questions:
- Are (Sohl-Dickstein et al. 2015) indeed predicting mean and covariance or is that just a matter of mathematical notation and in practice, the neural networks are trained on predicting noise?
If the former is the case in and we are training neural networks to predict parameters of normal distributions, then:
-
What is the basic structure of our neural networks, do they take a noisy image \mathbf{x}^{(t)} and timestep t as input and return a matrix of mean and variance values for each pixel?
-
Given networks \mathbf{f}_\mu and \mathbf{f}_\Sigma how do we then generate images? Is it just a mater of sampling each pixel’s value from it’s coresponding normal distribution, where each pixel’s normal distribution is defined by parameters predicted by a neural net?
Answers to any questions above would be very welcome!
Although (Sohl-Dickstein et al. 2015) provide a reference implementation, I found it difficult to find the answers to the questions above there, as the code is written in a fairly obscure framework (i.e. Blocks).