Attempting to generate images using nothing but CLIP

Based on what Jeremy said at the start of the whiteboarding in lecture 9, I thought it might be possible (and simpler) to generate images directly using CLIP, simply by tweaking pixels using gradient descent, without using any sort of noise predictor model.

It…sort of works! (But it’s also nowhere near Stable Diffusion.) Check out the notebook here (Google Colab) and the resulting images here ( — Edit: The notebook has a lot of comments so might be pretty readable even if you’re not too familiar.

I would love to hear anyone’s thoughts on this, including (1) whether you think it could be made to work better and (2) why Stable Diffusion’s additions to CLIP help avoid weird results like this.


How about something with a more semantic latent space that is easier to navigate with gradient descent? Maybe something that easily gives a realistic image? Something like BigGAN or StyleGAN. Or even VQGAN? And that’s how you end up with BigSleep and VQGAN+CLIP! This is effectively how the AI art scene started in 2021.

Daniel Russell later did some more experiments with direct pixel optimization and got some interesting results. Check it out: