Generate variations of an image using CLIP embeddings?

I just watched this YouTube video and was wondering if anyone had any ideas how the person in the video interpolated variations of images between the two input images (and also later adding and subtracting text embeddings)? What does the inference flow / pipeline look like?

Utilize CLIP embeddings by interpolating image vectors for variations. Add or subtract text embeddings for additional creative adjustments in image generation.

how do you go from embeddings back to an image tho?