I tried to generate images using the same prompt, but in different languages. (All translated from English using DeepL.) The first one was this prompt.
A picture of a town hall in a historical quarter of a city
Seems like the model aligned texts written in a given language with the most common pictures from that area. Except for Greek, which collapsed. I guess the language isn’t well represented in the dataset (?). Also, an interesting interpretation of Estonian town hall. Too many castle pictures associated with this language?
The second one I tried is this.
A crowded street in a big city on a winter morning
Again, the Greek one is rather cumbersome. Is it a kind of “averaged” tourist’s selfie? Not sure if Estonian landscapes really have such kind of mountains… Also, English, German, and French are captured pretty accurately. The most frequent language/image pairs in the dataset?