Lesson 11 official topic

johnowhitaker · October 26, 2022, 2:26am

If you want to drop in DDIMScheduler with the stable diffusion stuff you need the right settings. I think the magic code is scheduler = diffusers.DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)

jeremy · October 26, 2022, 2:37am

Where does one find the magic code? And why do those particular params work in this case?

johnowhitaker · October 26, 2022, 2:45am

I stole them from someone on Discord, but if you want to do it properly you can look at the scheduler config: scheduler/scheduler_config.json · CompVis/stable-diffusion-v1-4 at main
The extra hitch is that I think DDIMScheduler has clip_sample=True as the default, which clips the sample to (-1, 1) and I think causes issues in the SD case.
I’m also now thinking that since I last tried this the step of manually scaling the model inputs is gone and replaced by scheduler.scale_model_inputs, which for lms_discrete does some scaling (as it should) but for DDIM does nothing! So that might be an problem…

EDIT:

Just tried this and it worked without needing to change the model input scaling, so that is indeed the magic code and you can ignore my extra worry above

tolgaeren · October 26, 2022, 10:51am

About accessing the ‘latex source’ for deciphering the formulas:

Sometimes there’s a faster way than downloading the source package and checking the tex file:

Using the HTML5 viewer:

Go to you favorite arxiv document

image1773×836 124 KB
Replace the ‘x’ in ‘arxiv’ with a ‘5’, hit enter:

image2916×1556 273 KB
Now you can go to any formula and right click-> show math as → annotation → tex:

image1920×796 124 KB
This will give you the original ‘TeX’ source.

image2090×1074 391 KB

PS: Only works for papers released before Sep 2022 (yet?)

pcuenq · October 26, 2022, 11:03am

Replace the ‘x’ in ‘arxiv’ with a ‘5’, hit enter

That’s amazing! It appears to be using LaTeXML to render to HTML. There are some glitches (images are not displayed, unfortunately, in the papers I tried), I hope they keep supporting this conversion!

descobar7 · October 26, 2022, 6:00pm

For the semantic edit, is it possible to take a vector that represents for example a camara view so to say: front, left, back, top. And use that to re imagine the object from those views?

barnacl · October 26, 2022, 6:47pm

Are you looking for something like this

Dreambooth

descobar7 · October 26, 2022, 6:53pm

yess something like that thank you.

charlie · October 26, 2022, 11:01pm

@descobar7 I believe the folks behind https://dreamfusion3d.github.io/ explored “from the back” “from the side” prompts to deal with the so called “Janus” problem, where you’d get 2+ front-facing forms on the 3d image.

Screen Shot 2022-10-26 at 23.38.52

After all these years of wringing our hands at the black boxes of neural nets…we can just tell them what we want with plain, natural language (rendered in octane, 8k, and trending on artstation, that is!)

I’ve made some headway DiffEditing but I think I’ve got some bugs. I’m not quite sure the correct formulation for the latents in the masked-out background. I’m definitely seeing the model being much better at masking than sensibly filling in the masked area – are other folks seeing this? I’m a little surprised that hte model thinks it should fill in one area and then only really fills in a subset. In any case, sometimes it’s good enough – here’s a successful example:

One effect you’ll notice is a preference for centered-ness – some of the results have bowls within the original bowl, because the model wants to have a perfectly squared-up shot.

And here’s a … less successful example…

(Originals in upper left)

(edit: I later had more success – fixing bugs and better-exploring the hyperparameters got much improved quality – shared on twitter here: https://twitter.com/bigblueboo/status/1585761916718383110)

Fahim · October 27, 2022, 12:52am

This is really great! I didn’t get as far as you with DiffEdit so far — I got stuck on figuring out the mask since I couldn’t get a good difference between the two state (original prompt vs edited prompt) and then got distracted by other experiments.

But I have a different approach (which wasn’t using the DiffEdit method) for which I’m keen to try out your set of “bowl of fruit” prompts … I didn’t think to try that with my approach though I did have success turning a horse into a zebra. So thank you for the idea

Fahim · October 27, 2022, 1:34am

I tried your approach of doing different fruits with my non-DiffEdit image replacement and the results were interesting … I don’t get a full replacement of all the fruit, but I do get a majority for the type of fruit I specify … Now I’m keen to try and get DiffEdit working and try the same seed to see how that behaves

aayushmnit · October 27, 2022, 3:04am

I tried to implement the first part of masking of DiffEdit - here is a sample result -

Hope this might help people to implement DiffEdit fully - Here is the notebook.

Fahim · October 27, 2022, 3:25am

@charlie Sorry about so many replies to your post, but I’m not exaggerating when I say that I learnt so many things from just this one post

I learnt that I wasn’t testing my theories enough. You had such a great test set that I borrowed it. Hope you don’t mind …
I learnt that I should have more faith in my own approaches … and test more to make sure that the faith is justified
Test, test … test … Your test data really drove this home for me. I can’t explain why (maybe because I love SF and so the Star Trek image resonated? Who knows …) but thank you for making me realize this important fact

So based on the insights I gained from your post I used your test prompts to test my prompt editing method (which just to clarify is not using DiffEdit) and here’s what I got for the TNG prompt:

I’m pretty chuffed at the results and think that my approach might be a good approach at least for some use-cases. So thanks again for helping me realize it!

Fahim · October 27, 2022, 3:26am

Your mask looks way better than the results I got I’ll definitely be taking a look to see if I can borrow your masking method. Thank you!

Stevesfriendlyrobots · October 27, 2022, 12:47pm

If you need a quick refresher on how matrices are multiplied together Jeremy shared http://matrixmultiplication.xyz/ in a previous course.

johannesstutz · October 27, 2022, 1:21pm

I couldn’t watch the lesson live and am going through the “from the foundations” part now, playing around with matrix multiplication. On my machine, the test_close test fails when comparing the result of the naive, Python-only multiplication with the Numba/PyTorch versions. I tested this with various new randomly filled matrices, and it happens most, but not all of the times.

So I dug into it a little bit but am no wiser. The default value for eps that the test_close function uses is 0.00001 or 1e-5. (The test checks if two values are within eps of each other.)

The values that don’t match are - depending on the randomly generated weights - between around 2 and 50. The dtype of the Tensor is torch.float32. Even for the worst case, after checking out Wikipedia, the precision for a number between 32 and 64 should be 2^-18 so 3.81e-06.

So why is 1e-5 too precise and the test only is successful when setting eps to 1e-4? Is it because of the 784 multiplications and additions that happen for the dot product? Do those imprecisions add up and you can just get unlucky, when a lot of the rounding errors are in the same direction? Is this deterministic or can it depend on the hardware?

In general, how many digits of precision would one expect for fp16 operations?

johnri99 · October 27, 2022, 8:58pm

Working my way through the DiffEdit paper I have a couple of questions. Thanks for the suggestions in the posts above but if somebody could help clarify a couple of things that would be great. Firstly in the paper it says to start with a noise of 50% and then denoise to create a mask corresponding to a specific prompt, then to change the prompt and repeat. My questions how many steps are people denoising over and how are the timesteps being managed (I see the unet needs the timestep as an input, but if we want to control the noise and progressively reduce it over several steps then this implies to me that the default scheduler steps would need changing. I guess as an alternative I can seed the image with 50% noise and then “inject” it into the process at say step 10 using the DDIM scheduler. In the absence of any other suggestions I will try this and see how it works.

Fahim · October 27, 2022, 11:29pm

Your work on the mask helped me get a handle on creating a mask at my end. Thank you!

My notebook is here if you’d like to see how things are progressing …

Fahim · October 27, 2022, 11:33pm

I simply set up a break point for the timesteps loop so that instead of going to a specific step, the loop would iterate over each step till it got to a specific step and then break. Or, for the second stage where you want to start from some noise, skip over some steps and then start looping … Seems to work but I didn’t use the DDIM scheduler. Just the LMSD one we’ve been using so far …

My notebook is here if you’d like to take a look …

DrChrisLevy · October 28, 2022, 12:19am

Really impressed by how far some have you are getting on DiffEdit.
I am still trying to do step 1 which is to create the mask.
Im trying to not look at anyones code yet and see how far I can get on my own.
Im just using the same old scheduler LMSDiscreteScheduler we used in the notebook lesson 9 deep dive. Adding a little bit of noise and getting the two noise predictions for the two text prompts.
The input image looks like this:

and the difference of the noises (noise1-noise2) looks like this

I know I need to now normalize this or something and binarize it and make it white and black. Ha, sounds so simple but Ive been stuck on that part lol.