Lesson 9 official topic

Thanks a lot! you are right about the solution which is also suggested here too

1 Like

Hello, there.

I was wondering, how is the CLIP text embedding factored into the autoencoder’s loss? Is it simply concatenated to the pixels? If so, are the features of the embedding just weighted the same as the image pixels?

I assume dimensions are quite different, so getting the image right, in the end, would be much more important than getting the embedding right. That is, variations in a small (unimportant) subregion of the image would be more penalized than variations in the embedding. I suspect the details will be discussed in further lessons but I am curious!

3 Likes

you can also think that optimization processes don’t happen only during training, they can also happen during inference. So they are all optimization processes but the question is what is it that you are optimizing and tweaking? During training, you are typically tweaking the weights of the networks. And during inference of some generative systems you are tweaking a set of parameters, those parameters may in fact be latent compressed representations of the pixels of the images, or the pixels themselves (depending on the system, latent diffusion models vs standard diffusion etc)

In connection with that, it is very interesting to explore what Yann LeCun often talks about in relation to self-supervised systems. In many cases, during inference, there is a lot of uncertainty associated with the potential result. For example, when you are trying to predict the next frame of a video (many potential continuations at each stage), and so, again, LeCun often speaks about optimization processes that take place during inference to find the right latent variable that helps you find the best result. This then takes you to contrastive learning vs regularized and what he calls energy based models which is a way of simplifying in some ways probabilistic modelling (you get rid of the normalization stuff, etc), but anyway the point here is that nowadays we are moving more and more towards thinking of inference as an optimization process in itself

2 Likes

Yes, that makes sense. I was thinking a bit smaller scale (mostly from the perspective of software), but I can somewhat see how this also would apply to reinforcement learning and how it kinda all builds a huge unified framework of thinking :slight_smile:

1 Like

Yes, I’d love to see this addressed in one of the lectures. According to one of the authors (Tom Goldstein, the lab’s PI), Cold Diffusion violates all of the assumptions that go into our theoretical understanding. He put out a really interesting Twitter thread here: https://twitter.com/tomgoldsteincs/status/1562503814422630406

2 Likes

“It is intuitively possible that, in high dimensions, the word vectors form a smaller dimensional subspace within the full embedding space, and the positional vectors form a different smaller dimensional subspace approximately orthogonal to the one spanned by word vectors. Thus despite vector addition, the two subspaces can be manipulated essentially independently of each other by some single learned transformation. Thus, concatenation doesn’t add much, but greatly increases cost in terms of parameters to learn.”

This still seems rather odd to me. If a linear combination of activations forms a good enough representation, then shouldn’t the concatenation operations in DenseNet or Fastai AdaptiveConcat pooling layers be superfluous?

Has anyone found that when doing inference on a mixed precision model in the stable_diffusion.ipynb notebook they need to autocast their inputs? Is there some way to set this as a property of the pipe itself?

For example

from torch import autocast

with autocast("cuda"):
    images = pipe(prompt).images

works just fine but

torch.manual_seed(1024)
pipe(prompt).images[0]

RuntimeError: expected scalar type Half but found Float

EDIT: Installed latest as recommended, that resolved the issues. If you run into the problems above, just make sure you’re running all the libraries in their most up to date versions. I relied on what gradient had installed by default.

2 Likes

I started to play with SD deep-dive from Jhon’s notebook & it’s super with playing with the latent space.
Here’s a summary:

You can grab the full notebook from here.

8 Likes

Thx for the suggestions, I’ll give it a try.

Did you try to go to HuggingFace and find that repo then require access to it?

There is a model card in the Using Stable Diffusion section, click that and it should take you the website, where you can accept the terms and you have to create a token

Is anyone having a problem while re-watching the live stream? For me, it is caching every 30 seconds. At first, I thought that it was a connection problem but the rest of the Youtube videos are streamed OK.

1 Like

this is the solution:

2 Likes

Interpreting Stable Diffusion Using Cross Attention

Results from the repo.

grad-cam_photograph

  • Without ‘photograph’

out_infer35

22 Likes

wow very interesting! nice job getting it working :slight_smile:

1 Like

I was running out of memory on my 1070ti 8GB card for the 4x4 grid step, so I ended up putting the following line before the

images = concat(pipe(prompts,..<blah> line

pipe.enable_attention_slicing()

Thanks for the link to the huggingface optimizations page. :slight_smile:

9 Likes

I watched it within two hours of the video being made available and it was auto set to 240p but I just went to my settings and set it to 720p and it worked fine.

sometimes just reloading the page also works for me if it’s buffering a lot and has been recently release (within an hour or two of the lecture)

I’m in Canada btw.

1 Like

I think the training of image and text embeddings together with contrastive loss is such a neat idea! It inspired a question for me: After training text and image encoder to be “synced” up as described in the lesson, why can’t we do a more direct or end-to-end neural network? One that takes a text embedding as an input, learns weights that transform that text embedding into an image embedding, and takes the loss (say MSE) between that image embedding/final latent and the image tied to the input text prompt? In other words what limitations or disadvantages would such an architecture have compared to this unet that iteratively subtracts noise? Would this theoretical architecture even work?

1 Like

This is super cool, thanks for sharing. I wanted to run through the notebook on my surface studio laptop which has 4GB GPU RAM. With that line, I can actually process at least the initial piece of the notebook!

2 Likes

I decided to try spinning up a new environment to make it easier to work across multiple machines. I have been doing some traveling where I don’t have access to my main server but I have my surface studio laptop.

to set up this environment, I’m using a .devcontainer with vs code to generate an environment. I am planning on blogging about it, but the core of what I’m using comes from here: sparrowml/sparrow-patterns: A CLI for updating code patterns in Python projects (github.com). If anybody has questions about this setup, let me know and I will answer and incorporate them into my blog post!

5 Likes