Share your work here ✅ (Part 2 2022)

@ilovescience : you were right too - I should add to the chart a reference to original VAE paper too (Auto-Encoding Variational Bayes, which has the KL regularizer).

2 Likes

@johnri99
Yeah, the market ‘crisis’ which started in early 2022 has been interesting.
When trained on pre-2022 data, the model doesn’t predict the market shutdown & craziness that happened in June, at least for the price (Greenness predictions were fine). And the accuracy of price predictions for the rest of 2022 is much lower (about 50% worse MAE) if no 2022 data is in the training set.

But when some post-crisis data is used for training (it helped to do some basic data augmentation on this subset to increase the proportion of post-crisis data) the accuracy goes back to ‘normal’.

In general, the model seems pretty good at adjusting to new (higher) price expectations. Inputs to the model include prices for the last 7 days (and 15/50/85 percentile prices over this time), so even if it’s shown higher-than-ever-seen prices, it seems to predict “relative to what today’s price is”.

It’s in this area in particular (that is, adjusting to never-before-seen scenarios) that the basic MLP seems to perform better than the other fancier architectures (NBEATS, TFT, etc)… though it’s very possible that I’m just using them wrong! (would love any advice on what others have successfully used for similar forecasting problems!)

1 Like

Hey @hibijibi
What an interesting idea! Will you mind if I make a community pipeline for this on the diffusers library?

Sure - let me know if I can help.

1 Like

Hey folks :wave:, not sure about you but callbacks have always looked like some sort of wizardry trick to me :sweat:
I found the miniai implementation just great to play around with and I tried to demystify them in this post.
Feedback welcome!

Also links to Twitter and Linkedin posts :wink:

11 Likes

I grabbed a few images to share that show the improvements in the results after implementing new features and running experiments on the Style Transfer notebook. The first 3 images show my progression and the last 2 images are using different style images.

Some of the things I’ve implemented or tried so far:

  1. PyTorch forward hooks for feature extraction
  2. Replacing max pooling with average pooling in VGG16 (Paper)
  3. Weighting the style loss more heavily (multiplying by 100-500 so far seems to work best) (Paper)
  4. Dividing each layers component of the loss by the total number of layers used in the loss (Paper)
  5. Using the same layers for both style (1, 6, 11, 18, 25) and content loss (same plus 29) but weighting the losses by multiplying each layers style loss by 1 / (layer_idx+1) and content loss by 1 / (n_layers - layer_idx) . Essentially weighting the beginning layers higher for style loss and ending layers higher for content loss which is the general recommendation in the paper, though they accomplish this by committing layers.
#Weighted
class StyleLossToTarget():
    def __init__(self, target_im, target_layers=(1, 6, 11, 18, 25)):
        fc.store_attr()
        with torch.no_grad(): self.target_grams = calc_grams(target_im, target_layers)
    def __call__(self, input_im): 
        return sum((f1-f2).pow(2).mean()/len(self.target_layers)*(1./(idx+1)) for idx, (f1, f2) in 
               enumerate(zip(calc_grams(input_im, self.target_layers), self.target_grams)))

class ContentLossToTarget():
    def __init__(self, target_im, target_layers=(1, 6, 11, 18, 25, 29)):
        fc.store_attr()
        with torch.no_grad():
            self.target_features = calc_features(target_im, target_layers)
    def __call__(self, input_im): 
        return sum((f1-f2).pow(2).mean()/len(self.target_layers)*(1./(len(self.target_layers)-idx)) for idx, (f1, f2) in 
               enumerate(zip(calc_features(input_im, self.target_layers), self.target_features)))
  1. Increasing the steps to 1,200
  2. Sometimes starting with noise, sometimes the content image and sometimes a combination of the 2. Ex:
noise = torch.rand_like(content_im)/2.
model = TensorModel(noise+(content_im-noise.mean()))

Progression w/ Spider Web Style
Unknown
Unknown-2
Unknown-3
Farewell To Anger Style
Unknown-4
Starry Night Style
Unknown-5

11 Likes

Hi, this is awesome. I wonder if you are using the latest flexible learner without context manager from he repo or not in your explanation? Im also curious what are the main differences between Fastai and MiniAI. for one thing I think TypeDispatch is not used in miniAI and I don’t know what other differences exist that may cause you to want to use one over the other.

1 Like

I’ve also been experimenting with Style Transfer from last week’s lesson. Using the ReLU layers immediately before the VGG max pool layers again seems to work well here (excluding the last one, layer 29). That’s using layers 3, 8, 15 and 22 then each layer weighted differently for the content and style loss.

Content and style images:

remy2 starry

With the content image as the starting model:

content_im-start

With random noise like the content image as the starting model:

random-start

I had an idea to have the starting model to be the style image rather than the content image. Then the content features are added into the image whilst keeping the style of the output similar to the style image. I really like the result of this:

style_im-start

11 Likes

I saw an interesting demo that used GPT3 to answer questions about user provided data. What made this interesting is that the data was from after the training data cutoff for GPT3 and was also much larger than would fit into the prompt which is limited to ~2,000 tokens.

The way they accomplished this was to break the input data into smaller chunks of several hundred words and then ‘index’ each of those chunks by running them through the GPT3 embedding model. The questions are run through the same embedding model and the cosine distance is calculated between the question embedding and each of the text chunk embeddings by calculating the dot product. The chunks of text which received the highest scores are then added to the GPT3 completion prompt as ‘context’ and the model is asked to answer the original question.

I took the Lesson 18 transcription, which is about 17,000 words, broke it into video chapter sections and then subdivided any sections longer than 750 words into a total of 39 chunks of text. I then followed the same methodology described above to answer questions about the video. Overall it seemed to work pretty well! There’s certainly a lot of ways I could improve this including returning the video timestamp, but it was an interesting proof of concept. It reminded me of the project one of the other students did of indexing and querying all of the images on their computer using CLIP models. I’m interested to see how it would work if you also indexed the fast.ai or mini ai codebase and/or docs.

Here is the OpenAI example notebook I used as a reference.

Below is 1 full example of the question, full prompt and answer followed by a few more questions and answers. Each bullet point is for a separate chunk of text.

FULL EXAMPLE W/ PROMPT
Question:

What is the advantage of batch transforms?

Prompt:

Answer the question as truthfully as possible using the provided video transcription context, and if the answer is not contained within the text below, say “I don’t know.”

VIDEO TRANSCRIPTION CONTEXT:

  • And so here you can see this little crop it’s added. Now something you’ll notice is that every single image in this batch, notice I grabbed the first 16, so I don’t want to show you 1024, has exactly the same augmentation. And that makes sense, right, because we’re applying a BatchTransform. Now why is this good and why is it bad? It’s good because this is running on the GPU, right? Which is great because nowadays very often it’s really hard to get enough CPU to feed your fast GPU fast enough. Particularly if you use something like Kaggle or Colab that are really underpowered for CPU, particularly Kaggle. So this way all of our transformations, all of our augmentation is happening on the GPU. On the downside, it means that there’s a little bit less variety. Every mini batch has the same augmentation. I don’t think the downside matters though, because it’s going to see lots of mini batches. So the fact that each mini batch is going to have a different augmentation is actually all I care about. So we can see that if we run this multiple times, you can see it’s got a different augmentation in each mini batch. Okay, so I decided actually I’m just going to use 1 padding. So I’m just going to do a very, very small amount of data augmentation. And I’m going to do 20 epochs using OneCycle learning rate. And so this takes quite a while to train, so we won’t watch it, but check this out. We get to 93.8 That’s pretty wild. Yeah that’s pretty wild. So I actually went on Twitter and I said to the entire world on Twitter, you know, which if you’re watching this in 2023, if Twitter doesn’t exist yet, ask somebody to tell you about what Twitter used to be. Hopefully it still does. Can anybody beat this in 20 epochs? You can use any model you like, any library you like, and nobody’s got anywhere close. So this is pretty amazing. And actually, you know, when I had a look at papers with code, there are, you know, well, I mean, you can see it’s right up there, right, with the kind of best models that are listed, certainly better than these ones. And the better models all use, you know, 250 or more epochs. So yeah, if anybody, I’m hoping that somebody watching this will find a way to beat this in 20 epochs, that would be really great. Because as you can see, we haven’t really done anything very amazingly weirdly clever. It’s all very, very basic. And actually we can go even a bit further than 93.8. Just before we do, I mentioned that since this is actually taking a while to train now, I can’t remember, it takes like 10 to 15 seconds per epoch. So you know, you’re waiting a few minutes, you may as well save it. So you can just call torch.save on a model, and then you can load that back later.
  • So something that can make things even better is something called test time augmentation. I guess I should write this out properly here. Test, text, test time augmentation. Now test time augmentation actually does our BatchTransform callback on validation as well. And then what we’re going to do is we’re actually, in this case, we’re going to do just a very, very, very simple test time augmentation, which is we’re going to add a BatchTransform callback that runs on validate and it’s not random, but it actually just does a horizontal flip. Non-random, so it always does a horizontal flip. And so check this out. What we’re going to do is we’re going to create a new callback called CapturePreds. And after each batch, it’s just going to append to a list the predictions, and it’s going to append to a different list the targets. And that way we can just call learn.fit, train equals False, and it will show us the accuracy. And this is just the same number that we saw before. But then what we can do is we can call the same thing, but this time with a different callback, which is with the horizontal flip callback. And that way it’s going to do exactly the same thing as before, but in every time it’s going to do a horizontal flip. And weirdly enough, that accuracy is slightly higher, which that’s not the interesting bit. The interesting bit is that we’ve now got two sets of predictions. We’ve got the sets of predictions with the non-flipped version. We’ve got the set of predictions with the flipped version. And what we could do is we could stack those together and take the mean. So we’re going to take the average of the flipped and unflipped predictions. And that gives us a better result still, 94.2% So why is it better? It’s because looking at the image from kind of like multiple different directions gives it more opportunities to try to understand what this is a picture of. And so in this case, I’m just giving it two different directions, which is the flipped and unflipped version, and then just taking their average. So yeah, this is like a really nice little trick. Sam’s pointed out it’s a bit like random forest, which is true. It’s a kind of bagging that we’re doing. We’re kind of getting multiple predictions and bringing them together. And so we can actually, so 94.2 I think is my best 20 epoch result. And notice I didn’t have to do any additional training. So it still counts as a 20 epoch result. You can do test time augmentation where you do, you know, a much wider range of different augmentations that you trained with, and then you can use them at test time as well. You know, more, more crops or rotations or warps or whatever.
  • Hi folks, thanks for joining me for Lesson 18. We’re going to start today in Microsoft Excel. You’ll see there’s an Excel folder actually in the course22p2 repo. And in there there’s a spreadsheet called graddesc as in gradient descent, which I guess we should zoom in a bit here. So there’s some instructions here, but this is basically describing what’s in each sheet. We’re going to be looking at the various SGD accelerated approaches we saw last time, but done in a spreadsheet. We’re going to do something very, very simple, which is to try to solve a linear regression. So the actual data was generated with y = ax + b, where a, which is the slope, was 2 and b, which is the intercept or constant, was 30. And so you can see we’ve got some random numbers here. And then over here, we’ve got the ax + b calculation. So then what I did is I copied and pasted as values, just one set of those random numbers into the next sheet called basic.

Q: What is the advantage of batch transforms?
A:

GPT Provided Answer:

The advantage of batch transforms is that they are running on the GPU, which is great because it is often hard to get enough CPU to feed the fast GPU fast enough.

More Q/A’s:
Q: How does mini ai store optimizer state?
A: Mini ai stores optimizer state as attributes on the parameter.

Q: What was the score after adding random erasing?
A: 94.6

Q: Please explain in detail how test time augmentation works?
A: Test time augmentation is a technique used to improve the accuracy of a model by making predictions from multiple different directions. It works by applying a BatchTransform callback to the validation set, which in this case is a horizontal flip. This gives the model more opportunities to understand what the image is a picture of. The predictions from the flipped and unflipped versions are then stacked together and the mean is taken, which gives a better result than either version alone.

Q: What function cancels out weight decay?
A: BatchNorm

Q: What is a shortcut connection?
A: A shortcut connection is a way to create a 56 layer network with the same training dynamics as a 20 layer network by adding two convs and a skip connection, where the output is equal to conv2 of conv1 of in plus in.

7 Likes

I wonder if you are using the latest flexible learner without context manager from he repo or not in your explanation?

Ah damn! The code evolves too fast :smiley: . I am using the one with the custom context manager, but as you rightfully point out, it is not the latest anymore.

class _CbCtxInner:
    def __init__(self, outer, nm): self.outer,self.nm = outer,nm
    def __enter__(self): self.outer.callback(f'before_{self.nm}')
    def __exit__ (self, exc_type, exc_val, traceback):
        chk_exc = globals()[f'Cancel{self.nm.title()}Exception']
        try:
            if not exc_type: self.outer.callback(f'after_{self.nm}')
            return exc_type==chk_exc
        except chk_exc: pass
        finally: self.outer.callback(f'cleanup_{self.nm}')

Im also curious what are the main differences between Fastai and MiniAI

That’s a great question. I think you are right that type dispatch is not used. MiniAI doesn’t have much focus on the data. We are treating images only and I guess Jeremy didn’t want to go through the pain of figuring other types out.

3 Likes

I have been cooking this for a while, but using miniai and some cool callbacks, I fonally put together an artcile on how to create a super simple next frame prediction pipeline using DDPM!

In the article we create a dataset that is basically MMNIST digits moving on a canvas using torch.transforms.affine and create a sequence of frames:

Then, we train a model using the DDPM notebooks as guideline, and slightly modify the noisify function so we can use the same training code.
For more details take a look at the article

This is a toy example, but shows that a super simple concatenation of frames is “good enough” to produce reasonable results!

code

7 Likes

I modified my fast neural style transfer notebook to use miniai. The notebook is a bit of a mess, but I wanted to share it before I left for the weekend in case anyone wanted to play around with it. I will clean it up when I get back.

I added a VGGStyleLoss class to calculate the loss using hooks to get the vgg features.

I made a StyleOptCB that inherits from the MixedPrecision callback, because the bmm operation (whether done with bmm or ‘bfs, bgs → bfg’) does not work in half-precision. I call self.autocast.__exit__() before calculating the loss to get around this.

I like seeing the progress of the generated images, so I also added a ImageCheckpointCB and a ModelCheckpointCB to save sample images and model checkpoints at specified intervals.

Here is a random sample output:

Style Image:

You can download the dataset from Kaggle or swap it for another one.

11 Likes

Beautiful results!

I’ve been experimenting with the technique outlined in JohnO’s article, Mid-U Guidance: Fast Classifier Guidance for Latent Diffusion Models.

But instead of a “classifier” model, I’ve used an image regression model trained on a head pose dataset to predict head position, orientation and scale and I am using that to guide stable diffusion inference.

In these sample images, I’m using my model to steer stable diffusion to center the image and to “pose” the generated image in a given direction not with a text prompt… but by using my model to steer the orientation using numeric targets for pitch, yaw, x, y and scale.

Still working on it (and having fun with it!)… but JohnO’s MIDU approach is ingenious and can be used to drive stable diffusion inference with just about any image model.

Prompt: ‘Photo of a woman’

Prompt: “Photo of a woman”

Prompt: “magazine with a woman on the cover”

Prompt: ‘Photo of a man’

Prompt: ‘Photo of a man’

9 Likes

More of my experiments using JohnO’s MidU technique along with my head-pose image regression model.

Guiding Stable Diffusion to generate heads (numeric targets) with the specified position and pose. This time interpolating thru the range of numeric pitch and yaw “angles” while keeping the generated head centered…

This is using the stock Stable Diffusion model… no fine-tuning on heads or anything else… just done at inference time…

Amazing work JohnO!

I’m not controlling for head roll to give more natural looking poses…

Next step maybe eye direction in addition to head pose… lots of fun!

text prompt: Photo of a young woman

7 Likes

I got sidetracked with some experiments, but I’ll upload the cleaned-up notebook tomorrow. Getting the notebook to work on Kaggle was also a bit annoying, specifically with PyTorch 1.13. I ended up downgrading to PyTorch 1.12 for the Kaggle notebook.

One of the things I tried was using the convnext_nano model from the timm library instead of the VGG models. I’m unsure if I like the results more, but the colors seem closer to the style image.

I have not had a chance to test many layer combinations for the convnext_nano model. The above sample is from using all the GELU activation layers and the loss scaling described by @matdmiller earlier.

Training time is about as long with the convnext_nano model as with the VGG19 model.

6 Likes

Ok, here is the link to the cleaned-up notebook.

Also, I tried one of the new convnextv2_nano models with the same style and content weights and all the GELU layers and got wildly different (and bad) results. I did not have a chance to investigate why, though.

3 Likes

Another approach is to use a generate AI model to derive questions from the transcript. Then you can train a Q&A model. This avoids the cost of seeding the prompt for every question with context. It also overcomes the limits of semantic search. There is a good blog on this approach here if interested Improving Search Ranking with Few-Shot Prompting of LLMs | Vespa Blog

1 Like

It took me “a bit of time” but eventually I’ve made it!
The quarto blog is live at https://artste.github.io/blog/posts/real-images-island.

There is a little introduction to some of the techniques and a deep dive into the ideas I’ve tried.

16 Likes

Wow amazing! Do you have a tweet I can share?

4 Likes