The article from the first link is interesting – they’re trying to predict parts of an image given a starting point (e.g where a cat’s ears should be if you have its face).
This sounds just like a thought I just had, working on a project: can you predict features (houses, trees, etc.) in an image just like a language model predicts words? So you could train a general-purpose vision model that understands what the world looks like and how things fit together. You’d use images themselves as their own labels.
I don’t see why not. The basic look of a house or a face and its relationship to its surroundings is an encoded concept, just like a word or phrase in a sentence or paragraph. Language models do it w/ series of numericalized words – numbers – … pixels are just another organization of numbers: stacks of 2D arrays.
It also makes sense b/c it fits a trend of starting w/ a discrete, manual method, and moving towards implicit learned methods: there was a lot of focus in NLP around n-grams (a group of 2, 3, etc, words) and other things like that; that evolved into word embeddings; and finally full language models. Most vision models are pretrained on ImageNet: 1,000 classes on 1+ million images. The idea is the model learns relationships about the visual world as a byproduct of classification. What if, like language models, we went straight to learning those relationships from images – utilizing all the data in an image, not just what’s needed to identify a dog or bus.
Usually you hear about ideas maturing in vision and moving into NLP, interesting to see it go the other way.
Now how would you go about it? It’s very easy to get computationally overwhelmed in this. Maybe a heatmap-based method: the model is given a part of the image, and it progressively predicts regions around it. Language models just need to predict the next word b/c sentences are 1-D ‘on paper’, so it’d make sense to use a technique that took advantage of the 2D+ shape of visual data.
Anyway, there’re a few thoughts.