Lesson 10 official topic

Here is my understanding maybe that helps (if I am wrong someone correct me):
detail0



For more details one can look at the step function via
??scheduler.step

2 Likes

Ah, I finally get scheduler.step! I was getting confused because for some reason, I was thinking that the unet also removes noise. But no, it’s the scheduler that’s removing the noise through the step method.

I’m still a bit confused over latents = latents * scheduler.init_noise_sigma and inp = scheduler.scale_model_input though.

Judging from what I’m reading in the screenshots, I think the first line scales the noise to a certain distribution, and the second line further scales the noise to match the K-LMS algorithm.

So is the first line preparing a foundation of sorts from which noise can be further scaled to match whatever algorithm we wish?

But that brings up another question: if my understanding above is correct, the scheduler object we have was instantiated from a class that is for the K-LMS algorithm. So why does the noise have to be scaled twice?

scheduler = LMSDiscreteScheduler(...)

No the unets task (as far as i understand is) to predict the noise residuals.
Today and tomorrow i am very busy at work.
I will try to look into more detail at friday or saturday and write more detailed explanantion.
But what we can directly see: at beginning we can almost neglect the +1 in the square root. So we rescale to unit variance.

Yeah, the U-Net predicts the noise in the image, which is then subtracted by the scheduler.

Sure! Explain when you have the time. :slightly_smiling_face:

Yes, removing the 1 does approximate it to a distribution of N~(0, 1). :thinking:

for i, t in enumerate(tqdm(sched.timesteps)):
  # For CFG
  inp = torch.cat([lats] * 2)
  inp = sched.scale_model_input(inp, t)

  # Predict noise residual.
  with torch.no_grad(): pred = unet(inp, t, encoder_hidden_states=txt_embs).sample

  # Perform guidance.
  pred_uncond, pred_txt = pred.chunk(2)
  pred = pred_uncond + g_scale * (pred_txt - pred_uncond)

  # Compute the "previous" noisy sample.
  #  Not quite sure what's happening here.
  lats = sched.step(pred, t, lats).prev_sample

Let’s explain line by line.

lats = lats * sched.init_noise_sigma

this does inital scaling to N(0,\sigma_T), where \sigma_T is the variance for the last timestep our schedule (we start with T, then T-1, …, then finally 1).
Note that \sigma_1>\sigma_0 etc. We gradually tune down the noise in our schedule.
Then we loop over the scheduler timesteps.
Note: As mentioned above we start from the last timestep (with largest noise) and then go down one timestep for each iteration.

inp = torch.cat([lats] * 2)

simply copies the latent vector into two samples. Why? Because we have also two prompts in our text embedding. The unconditional empty string embedding and our prompt.
Next step:

inp = sched.scale_model_input(inp, t)

according to the documentation of the scheduler this step is only needed for algorithmic purpose. In the doc we find:

    def scale_model_input(
        self, sample: torch.FloatTensor, timestep: Union[float, torch.FloatTensor]
    ) -> torch.FloatTensor:
        """
        Scales the denoising model input by `(sigma**2 + 1) ** 0.5` to match the K-LMS algorithm.
        Args:
            sample (`torch.FloatTensor`): input sample
            timestep (`float` or `torch.FloatTensor`): the current timestep in the diffusion chain
        Returns:
            `torch.FloatTensor`: scaled input sample
        """
        if isinstance(timestep, torch.Tensor):
            timestep = timestep.to(self.timesteps.device)
        step_index = (self.timesteps == timestep).nonzero().item()
        sigma = self.sigmas[step_index]
        sample = sample / ((sigma**2 + 1) ** 0.5)
        self.is_scale_input_called = True
        return sample

So this just grabs \sigma_t and rescales the variance of the input to make the algorithm work.
Next step is the unet, which just predicts the noise residuals. Then we perform guidance.
The larger we set the guidance constant g_{scale} the more we bias towards the original prompt.
Then we use the step function of the scheduler:
This is the code:

    def step(
        self,
        model_output: torch.FloatTensor,
        timestep: Union[float, torch.FloatTensor],
        sample: torch.FloatTensor,
        order: int = 4,
        return_dict: bool = True,
    ) -> Union[LMSDiscreteSchedulerOutput, Tuple]:
        """
        Predict the sample at the previous timestep by reversing the SDE. Core function to propagate the diffusion
        process from the learned model outputs (most often the predicted noise).
        Args:
            model_output (`torch.FloatTensor`): direct output from learned diffusion model.
            timestep (`float`): current timestep in the diffusion chain.
            sample (`torch.FloatTensor`):
                current instance of sample being created by diffusion process.
            order: coefficient for multi-step inference.
            return_dict (`bool`): option for returning tuple rather than LMSDiscreteSchedulerOutput class
        Returns:
            [`~schedulers.scheduling_utils.LMSDiscreteSchedulerOutput`] or `tuple`:
            [`~schedulers.scheduling_utils.LMSDiscreteSchedulerOutput`] if `return_dict` is True, otherwise a `tuple`.
            When returning a tuple, the first element is the sample tensor.
        """
        if not self.is_scale_input_called:
            warnings.warn(
                "The `scale_model_input` function should be called before `step` to ensure correct denoising. "
                "See `StableDiffusionPipeline` for a usage example."
            )

        if isinstance(timestep, torch.Tensor):
            timestep = timestep.to(self.timesteps.device)
        step_index = (self.timesteps == timestep).nonzero().item()
        sigma = self.sigmas[step_index]

        # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
        if self.config.prediction_type == "epsilon":
            pred_original_sample = sample - sigma * model_output
        elif self.config.prediction_type == "v_prediction":
            # * c_out + input * c_skip
            pred_original_sample = model_output * (-sigma / (sigma**2 + 1) ** 0.5) + (sample / (sigma**2 + 1))
        elif self.config.prediction_type == "sample":
            pred_original_sample = model_output
        else:
            raise ValueError(
                f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, or `v_prediction`"
            )

        # 2. Convert to an ODE derivative
        derivative = (sample - pred_original_sample) / sigma
        self.derivatives.append(derivative)
        if len(self.derivatives) > order:
            self.derivatives.pop(0)

        # 3. Compute linear multistep coefficients
        order = min(step_index + 1, order)
        lms_coeffs = [self.get_lms_coefficient(order, step_index, curr_order) for curr_order in range(order)]

        # 4. Compute previous sample based on the derivatives path
        prev_sample = sample + sum(
            coeff * derivative for coeff, derivative in zip(lms_coeffs, reversed(self.derivatives))
        )

        if not return_dict:
            return (prev_sample,)

        return LMSDiscreteSchedulerOutput(prev_sample=prev_sample, pred_original_sample=pred_original_sample)

I think what we do here is simply that we take our latents, our predicted noise for current timestep and the current amount of noise.
Then we return what we believe to be the latent at timestep t-1.
The returned sample should have variance \sigma_{t-1}.
From the code above its also clear how we use our predicted noise:
derivative = (sample - pred_{original.sample}) / \sigma
We take our sample, substract the noise and rescale the resulting sample in variance. (Note that if we predicted the noise perfectly then sample - pred_{original.sample} would be a fully denoised picture.)
Then we use this to make a step into the right direction (Jeremy explained this in more detail in lecture 9 i think and made analogy to the usual learning process in DL).
Then we reiterate this process until we reach the last timestep where the returned sample should have variance \sigma=0.
If we decode this, it will give us a non noisy, ordinary picture.
I hope this made things more clear.
Essentially what we try to do is the following:
Assume someone took a picture and blurred it up more and more at every timestep. How can we learn to reverse this process. As a guidance for what the picture should contain (and not contain) we have our prompt and potentially negative prompt instead of uncond_text, as well as the blurry (or at the end less and non blurry picture). We predict the noise in the picture and use this information to denoise the picture gradually with each timestep.

2 Likes

I appreciate the detailed explanation! This helped clear things up.

One thing that’s not yet quite making sense to me though is why we calculate the latent at the previous timestep.
lats = sched.step(pred, t, lats).prev_sample
The latent has already been denoised with the following line.
pred = pred_uncond + g_scale * (pred_txt - pred_uncond)
So can’t we simply let lats = pred for the next loop?

Note that the PREVIOUS STEP in time is the NEXT STEP in our loop ;-).
We progress backwards in time.
Also if you would set lats = pred that would not make sense: pred is linear combination of stuff we get from the unet and unet predicts noise. we want to get original image.
we use the predictions to denoise our latents gradually! they are not prediction of the timestep before.

3 Likes

Ahh, I think I get it now! The step number goes down in each proceeding iteration. Subtle detail heh.

Also if you would set lats = pred that would not make sense: pred is linear combination of stuff we get from the unet and unet predicts noise. we want to get original image.

Yeah, that makes sense. I was jumbling up the process in my mind.
pred = pred_uncond + g_scale * (pred_txt - pred_uncond)
pred_uncond is the unconditional noise, while pred_txt is the prompt noise. I was thinking one of them was the latent for some reason.

Thank you for your help!

1 Like

After doing this lesson, I managed to implement my own custom stable diffusion class using the Diffusers library, and that was pretty satisfying heh.

I also managed to implement callbacks and negative prompts too!

Negative Prompts

Prompt = ‘An antique 18th century painting of a gorilla eating a plate of chips.’

Negative Prompt = ‘plate’

Callbacks

Prompt = ‘A toaster in the style of Jony Ive; modern; realistic; different; apple; form over function’

However, I didn’t quite manage to get image to image working htough.

I’ve written up how I implemented my own class in the blog post below.

4 Likes

After learning about how progressive distillation works, I can’t help but wonder if I’ve thought of a way to improve it:

Rather than having a single student approximate all the steps of the teacher’s denoising at once, why not have each round of progressive distillation involve twice the number of student UNets? This way, each student could specialize in a single pair of stages of denoising, which would be quite different tasks depending on how early or late in the process those stages are. I hypothesize that this would improve the quality of inferences, which in turn would make it possible to reduce the number of diffusion stages in the final distilled system.

Downsides of this would include 1) much more space/RAM required for the finished distilled model, which would actually be several models stapled together, and 2) no flexibility in terms of the number of denoising stages you want to use when the model is deployed. So there are big tradeoffs, but you could get the performance boost of distillation with higher-quality results (or better performance with identical-quality results, or somewhere in between those two).

Am I making sense?

1 Like

I’m having trouble getting the stable diffusion notebook to run locally. I have pip3 installed diffusers and transformers. However, I get the following errors (which I can’t find references to in Google search):

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\diffusers\pipelines_init_.py:47
45 from …utils.dummy_torch_and_transformers_objects import * # noqa F403
46 else:
—> 47 from .alt_diffusion import AltDiffusionImg2ImgPipeline, AltDiffusionPipeline
48 from .audioldm import AudioLDMPipeline
49 from .controlnet import (
50 StableDiffusionControlNetImg2ImgPipeline,
51 StableDiffusionControlNetInpaintPipeline,
52 StableDiffusionControlNetPipeline,
53 )

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\diffusers\pipelines\alt_diffusion_init_.py:31
27 nsfw_content_detected: Optional[List[bool]]
30 if is_transformers_available() and is_torch_available():
—> 31 from .modeling_roberta_series import RobertaSeriesModelWithTransformation
32 from .pipeline_alt_diffusion import AltDiffusionPipeline
33 from .pipeline_alt_diffusion_img2img import AltDiffusionImg2ImgPipeline

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\diffusers\pipelines\alt_diffusion\modeling_roberta_series.py:6
4 import torch
5 from torch import nn
----> 6 from transformers import RobertaPreTrainedModel, XLMRobertaConfig, XLMRobertaModel
7 from transformers.utils import ModelOutput
10 @dataclass
11 class TransformationModelOutput(ModelOutput):

ImportError: cannot import name ‘RobertaPreTrainedModel’ from ‘transformers’ (C:\Users<user>\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers_init_.py)

Any ideas?

I found generators a bit tricky to understand properly, so I’ve put together a list of 10 practice questions to help practice generators in python. You can find both the questions and solutions here: fastai-p2/generators_practice_questions.txt at main · karthikven/fastai-p2 · GitHub.