Lesson 9 official topic

Daniel · October 13, 2022, 8:48am

Thanks a lot! you are right about the solution which is also suggested here too

bernat · October 13, 2022, 9:04am

Hello, there.

I was wondering, how is the CLIP text embedding factored into the autoencoder’s loss? Is it simply concatenated to the pixels? If so, are the features of the embedding just weighted the same as the image pixels?

I assume dimensions are quite different, so getting the image right, in the end, would be much more important than getting the embedding right. That is, variations in a small (unimportant) subregion of the image would be more penalized than variations in the embedding. I suspect the details will be discussed in further lessons but I am curious!

javismiles · October 13, 2022, 10:04am

you can also think that optimization processes don’t happen only during training, they can also happen during inference. So they are all optimization processes but the question is what is it that you are optimizing and tweaking? During training, you are typically tweaking the weights of the networks. And during inference of some generative systems you are tweaking a set of parameters, those parameters may in fact be latent compressed representations of the pixels of the images, or the pixels themselves (depending on the system, latent diffusion models vs standard diffusion etc)

In connection with that, it is very interesting to explore what Yann LeCun often talks about in relation to self-supervised systems. In many cases, during inference, there is a lot of uncertainty associated with the potential result. For example, when you are trying to predict the next frame of a video (many potential continuations at each stage), and so, again, LeCun often speaks about optimization processes that take place during inference to find the right latent variable that helps you find the best result. This then takes you to contrastive learning vs regularized and what he calls energy based models which is a way of simplifying in some ways probabilistic modelling (you get rid of the normalization stuff, etc), but anyway the point here is that nowadays we are moving more and more towards thinking of inference as an optimization process in itself

miko · October 13, 2022, 10:24am

Yes, that makes sense. I was thinking a bit smaller scale (mostly from the perspective of software), but I can somewhat see how this also would apply to reinforcement learning and how it kinda all builds a huge unified framework of thinking

jwuphysics · October 13, 2022, 2:33pm

Yes, I’d love to see this addressed in one of the lectures. According to one of the authors (Tom Goldstein, the lab’s PI), Cold Diffusion violates all of the assumptions that go into our theoretical understanding. He put out a really interesting Twitter thread here: https://twitter.com/tomgoldsteincs/status/1562503814422630406

jwuphysics · October 13, 2022, 2:38pm

“It is intuitively possible that, in high dimensions, the word vectors form a smaller dimensional subspace within the full embedding space, and the positional vectors form a different smaller dimensional subspace approximately orthogonal to the one spanned by word vectors. Thus despite vector addition, the two subspaces can be manipulated essentially independently of each other by some single learned transformation. Thus, concatenation doesn’t add much, but greatly increases cost in terms of parameters to learn.”

This still seems rather odd to me. If a linear combination of activations forms a good enough representation, then shouldn’t the concatenation operations in DenseNet or Fastai AdaptiveConcat pooling layers be superfluous?

scaldwell · October 13, 2022, 3:33pm

Has anyone found that when doing inference on a mixed precision model in the stable_diffusion.ipynb notebook they need to autocast their inputs? Is there some way to set this as a property of the pipe itself?

For example

from torch import autocast

with autocast("cuda"):
    images = pipe(prompt).images

works just fine but

torch.manual_seed(1024)
pipe(prompt).images[0]

RuntimeError: expected scalar type Half but found Float

EDIT: Installed latest as recommended, that resolved the issues. If you run into the problems above, just make sure you’re running all the libraries in their most up to date versions. I relied on what gradient had installed by default.

arunoda · October 13, 2022, 7:47pm

I started to play with SD deep-dive from Jhon’s notebook & it’s super with playing with the latent space.
Here’s a summary:

You can grab the full notebook from here.

ItsJeffrey · October 13, 2022, 8:23pm

Thx for the suggestions, I’ll give it a try.

renee.liu · October 13, 2022, 8:23pm

Did you try to go to HuggingFace and find that repo then require access to it?

barnacl · October 13, 2022, 9:03pm

There is a model card in the Using Stable Diffusion section, click that and it should take you the website, where you can accept the terms and you have to create a token

nikem · October 13, 2022, 10:08pm

Is anyone having a problem while re-watching the live stream? For me, it is caching every 30 seconds. At first, I thought that it was a connection problem but the rest of the Youtube videos are streamed OK.

nikem · October 13, 2022, 10:51pm

this is the solution:

SHAR1 · October 13, 2022, 10:56pm

Interpreting Stable Diffusion Using Cross Attention

Results from the repo.

grad-cam_photograph

Without ‘photograph’

out_infer35

jamesrequa · October 13, 2022, 11:18pm

wow very interesting! nice job getting it working

mike.moloch · October 14, 2022, 12:10am

I was running out of memory on my 1070ti 8GB card for the 4x4 grid step, so I ended up putting the following line before the

images = concat(pipe(prompts,..<blah> line

pipe.enable_attention_slicing()

Thanks for the link to the huggingface optimizations page.

mike.moloch · October 14, 2022, 12:14am

I watched it within two hours of the video being made available and it was auto set to 240p but I just went to my settings and set it to 720p and it worked fine.

sometimes just reloading the page also works for me if it’s buffering a lot and has been recently release (within an hour or two of the lecture)

I’m in Canada btw.

TannerEllison · October 14, 2022, 12:33am

I think the training of image and text embeddings together with contrastive loss is such a neat idea! It inspired a question for me: After training text and image encoder to be “synced” up as described in the lesson, why can’t we do a more direct or end-to-end neural network? One that takes a text embedding as an input, learns weights that transform that text embedding into an image embedding, and takes the loss (say MSE) between that image embedding/final latent and the image tied to the input text prompt? In other words what limitations or disadvantages would such an architecture have compared to this unet that iteratively subtracts noise? Would this theoretical architecture even work?

KevinB · October 14, 2022, 3:15am

This is super cool, thanks for sharing. I wanted to run through the notebook on my surface studio laptop which has 4GB GPU RAM. With that line, I can actually process at least the initial piece of the notebook!

KevinB · October 14, 2022, 3:17am

I decided to try spinning up a new environment to make it easier to work across multiple machines. I have been doing some traveling where I don’t have access to my main server but I have my surface studio laptop.

to set up this environment, I’m using a .devcontainer with vs code to generate an environment. I am planning on blogging about it, but the core of what I’m using comes from here: sparrowml/sparrow-patterns: A CLI for updating code patterns in Python projects (github.com). If anybody has questions about this setup, let me know and I will answer and incorporate them into my blog post!