Since Jeremy mentioned the perceptual loss in lecture 9, I thought I’d share this notebook where I reproduced the Johnson, Alahi, and Fei-Fei 2016 paper using fastai. I did this in 2020 and haven’t updated it since, so I can’t guarantee it will work out of the box, but I hope someone will find it helpful.
I did this as a part of a more significant project where I reproduced a few papers using nbdev and fastai. Of course, I called it fastpapers Now that the new version of nbdev uses quarto that supports so many new features (like references) I’m eager to update this library!
Here are some images from the style transfer notebook
Was playing around with the various resources Jeremy listed and wanted to compare the deep dive notebook to what Lexica results were like. Same prompt and guidance scale. Both made me laugh!
I decided to see if I could make a picture of my daughters dog look like a unicorn. The initial dog is shown here. I used the img2img pipeline and the prompt “A unicorn”, then looked at different strength attributes. A very low strength (say 0.2) left the dog very much unchanged, wheras a high value of 0.7 lost the dog altogether. I found a nice result with a value of 0.3. Interestingly above this the head starts to get multiple horns! Interesting exercise, I can see already the need to learn to generate the right prompts and to tune the parameters. To explain further the values of strength in the rows are [0.2, 0.3, 0.4, 0.7]. I generated three images per prompt.
I used the lesson 9 notebook to generate a Banksy sketch of a robot. I needed a cool image for the first page of a powerpoint deck I was creating for another course’s project presentation.
I think she prefers the dog, our granddaughter on the other had likes the unicorn, but then she is into my little pony😀! could have lots of fun with this
I tried to generate images using the same prompt, but in different languages. (All translated from English using DeepL.) The first one was this prompt.
A picture of a town hall in a historical quarter of a city
Seems like the model aligned texts written in a given language with the most common pictures from that area. Except for Greek, which collapsed. I guess the language isn’t well represented in the dataset (?). Also, an interesting interpretation of Estonian town hall. Too many castle pictures associated with this language?
The second one I tried is this.
A crowded street in a big city on a winter morning
Again, the Greek one is rather cumbersome. Is it a kind of “averaged” tourist’s selfie? Not sure if Estonian landscapes really have such kind of mountains… Also, English, German, and French are captured pretty accurately. The most frequent language/image pairs in the dataset?
I wrote up some of my thoughts / learnings from lesson 9 in a blog (including a glossary of terms). It’s not fully updated with things from 9a or 9b yet, but I’ll probably write subsequent blogs alongside lesson 10 this week.
Great idea and thank you for sharing! I would also think that these images are the result of missing representation of other languages than English in the datasets. The question here is, which dataset matters? The one for CLIP training (not public) or for Stable Diffusion (LAION)? Both? I haven’t wrapped my head around these models yet
According to the model card, Stable Diffusion was trained on LAION-2B-en or subsets thereof, which consist of primarily English descriptions. So any other languages should be represented pretty poorly - I’m surprised by the decent results in German and French.
The black image for Greek in the first prompt is probably not a model failure, but got blocked by the overly aggressive NSFW filter, I suppose.
Yes, I agree, that’s interesting how it works! I don’t know quite well how the tokenizer was trained, but it definitely can take at least some meaning from non-English languages as well. I also expected that one might encounter something like “unknown token” error, but I guess it is not the case for such large models. Was it trained on the whole Unicode char set?
That’s true! For some languages, it seems the model tries to generate the most common, “standard” image of that language/country. (Like showing cathedrals along the river for Russian.) Or falls back to some “generic”, even if not related, picture.
Thank you for feedback! I expected that for other languages it would produce even worse results. These complex models require some time to figure them out! I am still feeling a bit lost, even though I am familiar with independent parts of it. But definitely don’t know enough about the data.
That’s a good point. I was also expecting for images that aren’t quite aligned with the prompt for non-English languages. But it seems that somehow, for these two, the model indeed worked quite well! Especially, compared to others, looks very plausible.
Oh, you’re right, I’ve completely forgotten about the filter. I will try to disable it to get the results “as is.”
Yea it is fun using interpolation between multiple prompts. I generated the clip embeddings
for these 4 prompts
a = embed_text(['Paris in Spring, digital art'])
b = embed_text(['Paris in Summer, digital art'])
c = embed_text(['Paris in Fall, digital art'])
d = embed_text(['Paris in Winter, digital art'])
and then just did linear interpolation from a to b to c to d and grabbed some
clip embeddings along the way. Then passed each one into the model and made sure to use the same seed 23532 before generating each image so they would be similar enough.
UPDATE: I tried Jeremy’s suggestion below. I’m not sure I understood it correctly but I think the idea was to start with the previous image in the latent space (Image2Image) before going to each next step. So I used the ideas in the deep dive notebook on writing your own function for Img2Img. I played around with it for a while and tweaking the parameters and start steps. But things sort of get less detailed as it progresses. Still looks sort of neat but not sure what I was going for. Maybe there is a bug in the code
UPDATE 2:
I’m not so sure you actually want to interpolate the latent space by starting with the previous image as the input. I could be wrong but when I do this it leads to weird effects. I think you want to interpolate from a to b and to make it smoother (more stable) just add more points in between and set a consistent seed/noise. For example, I went back to the interpolation like I first tried without using the image2image suggestion. I simply added more interpolation points. See here. Do you think its better ?
Hehe yes that’s a nice simple idea. I think I had thought about that then got distracted and just wanted to finish something simple. I’ll give that a try tomorrow though