Based on what Jeremy said at the start of the whiteboarding in lecture 9, I thought it might be possible (and simpler) to generate images directly using CLIP, simply by tweaking pixels using gradient descent, without using any sort of noise predictor model.
It…sort of works! (But it’s also nowhere near Stable Diffusion.) Check out the notebook here (Google Colab) and the resulting images here (https://user-images.githubusercontent.com/4140821/198151444-5557cbd0-5b17-471f-b683-9ef6f136efca.png). — Edit: The notebook has a lot of comments so might be pretty readable even if you’re not too familiar.
I would love to hear anyone’s thoughts on this, including (1) whether you think it could be made to work better and (2) why Stable Diffusion’s additions to CLIP help avoid weird results like this.