Stable diffusion: resources and discussion

jwuphysics · October 5, 2022, 1:33pm

Jay Alammar recently released the Illustrated Stable Diffusion. He previously wrote The Illustrated Transformer, which a lot of folks have found an excellent resource.

margaretmz · October 6, 2022, 3:19pm

A pytorch implementation of the text-to-3D model Dreamfusion, powered by the Stable Diffusion text-to-2D model. https://github.com/ashawkey/stable-dreamfusion.

cjmills · October 7, 2022, 6:03pm

I stumbled across this project on GitHub.

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Prompt: “check time”

erinjerri · October 8, 2022, 3:22pm

Hmm anyone see a new link for the Berkeley article “Understanding VQ-VAE?)”

cjmills · October 8, 2022, 3:46pm

Here is an archive of it

Part 1:
https://archive.ph/2021.07.14-200624/https://ml.berkeley.edu/blog/posts/vq-vae/

Part 2:
https://archive.ph/2022.04.24-184903/https://ml.berkeley.edu/blog/posts/dalle2/

vettukal · October 9, 2022, 4:10am

Great work @jamesrequa. Inspired by you I also checked out the collab notebook and managed to run it on my paperspace.

Do you have the prompts saved which was used to generate these images? Your images are much better than the outputs which I could get. Wondering if you did something different compared to the collab notebook.

jamesrequa · October 9, 2022, 5:41am

@vettukal Glad to hear you were inspired and nice work getting it running!

Unfortunately I didn’t save the prompts for these images. I didn’t make any significant changes to the original huggingface dreambooth sd notebook that I shared, except to:

upload the 5 images of jeremy as the training set
modify the instance prompt for the newly created concept to be a photo of sks person, I could’ve changed sks to something else but I didn’t think it really mattered
I chose not to do prior preservation because I didn’t plan to use the model for anything else

As for the prompts, it definitely takes a lot of experimentation & curation of outputs to get good results, some are even referring to prompt engineering as like an artform haha. I highly recommend using Lexica as a guide for creating better prompts. I believe for these images I used something along the lines of stunning portrait of sks person, by some artist, artstation & replacing some artist with various artist styles that tend to get good results (once again lexica is a great way to find out good artist names and keywords.

vettukal · October 9, 2022, 6:13am

Agreed, Lexica is very good resource for prompt engineering. Of the 5 photos you used, where all of them headshots or were there some body shots also?

jamesrequa · October 10, 2022, 4:30am

@vettukal I tried to have a mixed variety of images including some full body, chest up, and closeup. I think it also helps to have different angles, facial expressions, and background contexts. You definitely aren’t limited to 5 images I just wanted to test that you could get great results even with so few images.

marcossantana · October 11, 2022, 3:15pm

I just found this application of diffusion models to my field (computational chemistry):

brismith · October 11, 2022, 4:24pm

There are a few different articles out there - but I got Stable Diffusion working locally on my Apple M1 Ultra using this one MacBook M1: How to install and run Stable Diffusion | by Gonzalo Ruiz de Villa | gft-engineering | Sep, 2022 | Medium. Generates an image in around 18s.

jamesrequa · October 12, 2022, 9:21pm

Cool new paper / technique shows how text-to-image diffusion models can be used as zero-shot image-to-image editors, made possible by “inferring” the random seed of the input image!

Code implementation of “CycleDiffusion” here: GitHub - ChenWu98/cycle-diffusion

jamesrequa · October 13, 2022, 3:41am

Thanks for sharing! I have an M2 MBA with 24GB RAM and I followed that guide but it takes 1m30s per image on my side, would be so awesome to get it down to 18s. Not sure what’s wrong with my setup but been using colab in the meantime.

brismith · October 13, 2022, 12:58pm

I haven’t been able to get the notebooks working with the M1 - just getting noise (also issues with float64). Using cpu worked ok - but the speed dropped to about 2:40 per image. I switched over to my old box, a Linux machine with an 1080TI and the default dream.py images are produced in around 11 seconds - so I might see if I can get that working with the notebooks.

wyquek · October 13, 2022, 4:42pm

I could run all of the nb (except the last cell) with a linux 1080 8G and avoid a cuda error if I reduce the width by half

height = 512                    
width = 256

Didn’t get the same image as in the original nb, but pretty decent “half” images

brismith · October 13, 2022, 5:54pm

Thanks - and I saw the same memory issues. The deep dive notebook also runs well on the 1080TI with some kernel restarts ;).

jimmiemunyi · October 14, 2022, 5:12am

Sharing this video suggested by Tanishq on Twitter. Nice overview of SD although he does leave out some details, like adding noise on the encoded output by the VAE and not on the original image.
Overall nice video.

babayaga · October 14, 2022, 8:47am

wyquek · October 14, 2022, 1:43pm

it’s possible to run all cells at one go on Nvidia 1080 8G with some minor modifications to the nb, mostly consisting of splitting up some large cells into smaller ones

github.com

Qbiwan/diffusion-nbs/blob/master/Stable Diffusion Deep Dive_Nvidia_1080_8G.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5dcb2f23-7bcb-42d6-ae00-c336222dcd93",
   "metadata": {},
   "source": [
    "# Stable Diffusion Deep Dive\n",
    "\n",
    "Stable Diffusion is a powerful text-to-image model. There are various websites and tools to make using it as easy as possible. It is also [integrated into the Huggingface diffusers library](https://huggingface.co/blog/stable_diffusion) where generating images can be as simple as:\n",
    "```python\n",
    "from diffusers import StableDiffusionPipeline\n",
    "pipe = StableDiffusionPipeline.from_pretrained(\"CompVis/stable-diffusion-v1-4\", revision=\"fp16\", torch_dtype=torch.float16, use_auth_token=True).to(\"cuda\")\n",
    "image = pipe(\"An astronaught scuba diving\").images[0]\n",
    "\n",
    "```\n",
    "\n",
    "In this notebook we're going to dig into the code behind these easy-to-use interfaces, to see what is going on under the hood. We'll begin by re-creating the functionality above as a scary chunk of code, and then one by one we'll inspect the different components and figure out what they do. By the end of this notebook that same sampling loop should feel like something you can tweak and modify as you like. "
   ]
  },

This file has been truncated. show original

Also, the smaller images look better if you use landscape height=256 and width=512

amanmadaan · October 15, 2022, 6:05am

I was curious about training diffusion models for generating text, so have put together a minimal implementation here. This implementation can be used to train unconditional generative model of text, and also includes a small implementation of classifier guidance for conditional generation. Will be happy to answer questions about the implementation or diffusion models in general! Also includes a denoising/generative sampling loop visualization:

minimal-text-diffusion