If anybody else is interested, feel free to share questions or insights on here. I am planning on implementing this and will post here as I figure things out!
I saw that too and it looked really interesting, but I wasn’t sure how to even get started on that one since it talks about using multiple text encoders and mixing their output to guide the diffuser or something like that?
But if you have an idea as to how to get started, I’d be interested to participate …
I’m still going to try implementing a version of this, but not sure what it will look like yet. The massive amount of hardware really doesn’t seem to be the important concept coming out of this paper though.
Ah, I didn’t notice that part at all I generally skim through the papers to see if I can implement something and if it doesn’t make a lot of sense to me as to how I could get started, then I stop …
Something with similarly good output, but possibly easier to implement was UPainting. I considered starting a work together for that one but again, didn’t know how to get started because the addition there is an image-text matching component and I wasn’t sure what that looked like …
I believe there’s an implementation of Imagen around … If I recall correctly, by LucidRains(?) I’m going off of memory here. You have to train your own data for that though and that was something I tried to get going on Apple MPS way back when … Let me see if I can find the repo, it might be helpful for your research?