Share your work here ✅ (Part 2 2022)

I modified my fast neural style transfer notebook to use miniai. The notebook is a bit of a mess, but I wanted to share it before I left for the weekend in case anyone wanted to play around with it. I will clean it up when I get back.

I added a VGGStyleLoss class to calculate the loss using hooks to get the vgg features.

I made a StyleOptCB that inherits from the MixedPrecision callback, because the bmm operation (whether done with bmm or ‘bfs, bgs → bfg’) does not work in half-precision. I call self.autocast.__exit__() before calculating the loss to get around this.

I like seeing the progress of the generated images, so I also added a ImageCheckpointCB and a ModelCheckpointCB to save sample images and model checkpoints at specified intervals.

Here is a random sample output:

Style Image:

You can download the dataset from Kaggle or swap it for another one.

11 Likes

Beautiful results!

I’ve been experimenting with the technique outlined in JohnO’s article, Mid-U Guidance: Fast Classifier Guidance for Latent Diffusion Models.

But instead of a “classifier” model, I’ve used an image regression model trained on a head pose dataset to predict head position, orientation and scale and I am using that to guide stable diffusion inference.

In these sample images, I’m using my model to steer stable diffusion to center the image and to “pose” the generated image in a given direction not with a text prompt… but by using my model to steer the orientation using numeric targets for pitch, yaw, x, y and scale.

Still working on it (and having fun with it!)… but JohnO’s MIDU approach is ingenious and can be used to drive stable diffusion inference with just about any image model.

Prompt: ‘Photo of a woman’

Prompt: “Photo of a woman”

Prompt: “magazine with a woman on the cover”

Prompt: ‘Photo of a man’

Prompt: ‘Photo of a man’

9 Likes

More of my experiments using JohnO’s MidU technique along with my head-pose image regression model.

Guiding Stable Diffusion to generate heads (numeric targets) with the specified position and pose. This time interpolating thru the range of numeric pitch and yaw “angles” while keeping the generated head centered…

This is using the stock Stable Diffusion model… no fine-tuning on heads or anything else… just done at inference time…

Amazing work JohnO!

I’m not controlling for head roll to give more natural looking poses…

Next step maybe eye direction in addition to head pose… lots of fun!

text prompt: Photo of a young woman

7 Likes

I got sidetracked with some experiments, but I’ll upload the cleaned-up notebook tomorrow. Getting the notebook to work on Kaggle was also a bit annoying, specifically with PyTorch 1.13. I ended up downgrading to PyTorch 1.12 for the Kaggle notebook.

One of the things I tried was using the convnext_nano model from the timm library instead of the VGG models. I’m unsure if I like the results more, but the colors seem closer to the style image.

I have not had a chance to test many layer combinations for the convnext_nano model. The above sample is from using all the GELU activation layers and the loss scaling described by @matdmiller earlier.

Training time is about as long with the convnext_nano model as with the VGG19 model.

6 Likes

Ok, here is the link to the cleaned-up notebook.

Also, I tried one of the new convnextv2_nano models with the same style and content weights and all the GELU layers and got wildly different (and bad) results. I did not have a chance to investigate why, though.

3 Likes

Another approach is to use a generate AI model to derive questions from the transcript. Then you can train a Q&A model. This avoids the cost of seeding the prompt for every question with context. It also overcomes the limits of semantic search. There is a good blog on this approach here if interested Improving Search Ranking with Few-Shot Prompting of LLMs | Vespa Blog

1 Like

It took me “a bit of time” but eventually I’ve made it!
The quarto blog is live at https://artste.github.io/blog/posts/real-images-island.

There is a little introduction to some of the techniques and a deep dive into the ideas I’ve tried.

16 Likes

Wow amazing! Do you have a tweet I can share?

4 Likes

Here we go: https://twitter.com/stgiomo/status/1624784151776722944?s=20&t=CoDVn0oIElC0hKtE_q_TRw

3 Likes

BTW: quarto is awesome, but if you’re struggling to properly configure it for open-graph and twitter cards adding these lines to your _quarto.yml will solve the problem:

website:
  site-url: https://www.cynthiahqy.com
  open-graph: true
  twitter-card:
    creator: "@[your-twitter-account]"

More on this on: Cynthia Huang - Thumbnail Previews for Quarto Websites (for Dummies)

9 Likes

I’m still back at the beginning of lesson 18 so maybe this is covered later in the course, but I found dreambooth interesting so implemented a minimal version of that. I am using my version of miniai I am building for the class (AIsaac). Super similar but taken a couple different paths.

A few notes:

  • Very very basic/simplified minimal version of huggingface’s dreambooth script
  • My own dataset b/c I wanted validation images (friend’s dog)
  • It doesn’t generate very good images at all, but it is generating images of the correct dog. Lots to improve on, but a good starting point!
  • Good example of more “advanced” training as a callback using accelerate below!
class DreamBoothTrainCB(AccelerateCB):
    def before_fit(self, trainer):
        '''Wraps model, opt, data in accelerate'''
        trainer.model,trainer.opt,trainer.dls.train,trainer.dls.valid = self.acc.prepare(
            trainer.model, trainer.opt, trainer.dls.train, trainer.dls.valid)
        
        trainer.noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model, subfolder="scheduler")
        trainer.vae = AutoencoderKL.from_pretrained(pretrained_model, subfolder="vae", revision=None)        
        trainer.vae.to(self.acc.device, dtype=torch.float16)
        
    def before_batch(self,trainer):
        trainer.batch = fc.L(trainer.batch)
        trainer.batch.append(trainer.vae.encode(trainer.batch[0].float().to(dtype=torch.float16)).latent_dist.sample() * trainer.vae.config.scaling_factor) #latents 2
        trainer.batch.append(torch.randn_like(trainer.batch[2]).float()) #noise 3
        trainer.batch.append(torch.randint(0, trainer.noise_scheduler.config.num_train_timesteps, (trainer.batch[2].shape[0],), device=trainer.batch[2].device).long()) # timesteps 4
        trainer.batch.append(trainer.noise_scheduler.add_noise(trainer.batch[2], trainer.batch[3], trainer.batch[4])) # noisy latents 5
    
    def predict(self,trainer): trainer.preds = trainer.model(trainer.batch)
    def get_loss(self,trainer): trainer.loss = trainer.loss_func(trainer.preds,trainer.batch[3])
        
    def backward(self,trainer): 
        self.acc.backward(trainer.loss)
        if self.acc.sync_gradients: self.acc.clip_grad_norm_(trainer.model.parameters(), 1.)
   
    def zero_grad(self,trainer): trainer.opt.zero_grad(set_to_none=True)

Here’s the notebook so far with all the code: AIsaac - Dreambooth

7 Likes

We didn’t cover DreamBooth in detail so this is great to see! Looks like you are getting some promising results!

4 Likes

Yes, results are good so far. I plan to build it out a bit more and see how a more tuned version does, and then compare to the new paper that just came out a couple days ago which alleges only needing 1 image with 5 training steps!

Really cool results visual for fine tuning on a single image from the paper.

6 Likes

Who is teaching the Alians???

So @willsa14 and I have been hacking away at learning how to deploy an NLP model. The good news is that we had successfully built our first model where we fine_tuned GPT2 with a custom hip-hop lyrics database we scraped off a hip-hop website.

The goal was to create a text generation which we were able to do, and from a basic hip-hop structure, it works! However, as a proud New Yorker who grew up during the birth of Hip Hop. The model is an illustration of the worse stereotypes of the genre.

With that said, this project made me truly appreciate how ML can have bias built into its model as a default. Decerning bias in ML is not just about the code, it’s about everyday people getting outside of their bubbles and trying to understand other perspectives.

And that my friends is a function of the golden_rule().

Thank you @jeremy for putting me on this journey!

Ohh… Here is the model!!!

5 Likes

Guess what! Made the Context Length from 150 to 500 and my AI Ghost writer got a whole lot more respectable!

3 Likes

A few comments made by Jeremy early on in lesson 9 made me think maybe it might not be too hard to put together a little working demonstration of the core principle at work in what we do with diffusion models: i.e. updating an image based on some gradients derived from how close or not our image is to some desired kind of image.

For this tiniest of prototype POCs, I trained a model to detect handwritten ‘8’ digits (using standard fastai), then switched over to raw PyTorch to generate random noise, then get the derivatives with respect to the probability that our random noise was a number eight.

Surprisingly, the iterative process was successful in that it updated the pixels such that the model thought that the image was an ‘8’, but unsuccessful in that it did it in a way that was basically imperceptible. Which is to say, the process was a way of creating a sort of chimera image.

It didn’t work with the random noise such that you could magically watch the ‘8’ emerge from the noise (as I was hoping), but I suspect that has to do with how much we’re updating the pixels. We probably want to initially update the pixels a lot, and then as we continue onwards reduce the amount we’re updating. For now, I wrote up the process and I’ll keep going with the part two lessons :slight_smile:

A fun experiment and many thanks to those who popped in to chat and unblock me in the discord ‘live-coding’ room on Saturday and of course to the Delft FastAI study group who properly unblocked me today.

(and the code here in case you’re interested in tinkering)

13 Likes

This is similar to what I had done with

Where I trained an image regressive model to recognize head poses and used that to guide diffusion to generate heads with a given pose by steering the diffusion process with that model.

Something that you might want to try…

I trained my model with “noisy” (random amounts of noise) on the head pose images. (Pretty much a form of augmentation/regularization) that helps the model to recognize head poses that are just starting to emerge from the noise… The diffusion process will be steered more effectively and earlier in the denoising process and will be more gently guided into the right latent space…

You might want to try something similar… In the beginning of the diffusion process, nothing will look like an 8 (even something well-aligned to being denoised into an 8).

If you’re doing this with stable diffusion rather than your diffuser you can use SD’s unet to predict what a fully denoised image at each step will look like and use that (predicted denoised) image with your classifier… JohnO does that in his video coherence notebook here,

This is also pretty much what controlnet is doing with stable diffusion as well… ie using other models to provide guidance to the diffusion process…

8 Likes

Spurred on by discussions in our weekly fastai study group, I tried out perceptual loss as a way of updating the values so as to generate the eight. As @ste said, it’s sort of ‘cheating’ since you’re using the target image to guide it and of course our classifier is no longer involved, but I’ll tell you it was really nice to see the ‘8’ emerge out of the random noise FINALLY.

8 Likes

Kudos @strickvl - now it’s time to wipe out the noise :wink:

3 Likes