Multimodal end-to-end deep learning

I think this approach makes a lot of sense. There are some resources. E.g. You can very easily train a multi modal model with autokeras. https://autokeras.com/tutorial/multi/

You could also just build up a two seperate models, eg. with Keras, and then use a concatenation layer, to combine them and add a classification/regression head.