Are there any models or lines of research that take input from multiple type to train a model? Like a language model that also uses pictures of the words that are being read to give a second layer of thought to the model? Also maybe text plus audio to build a speech output that doesn’t sound computery. The other one would be creating a person’s lip movements when they talk by giving the transcript of what is being said and the visual of what their lips look like. Just curious if these are being combined currently or if these are pretty silo’d at the moment
You might want to read this: https://research.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html where the folks used data from different modalities.
That’s exactly the type of thing I was looking for. Thanks for sharing. Here’s a paper based on that page for more technical details: https://arxiv.org/pdf/1804.03619.pdf
Pretty much all our models we’ve looked at can work together in this way. Give it a try!