Some general things, I think I would do in this case
- For images, you train your CNN model independently using only images and CrossEntropyLoss. So when you get the best model that you can get, now remove all the linear layers and use the last conv layer output and flatten, and use it in your other models. (You can use linear layes also, but it is your choice)
- Generally, you will use ML models like LightGBM or CatBoost which will use the CNN model activations along with other tabular data, to train your final model.
It is a good approach that you followed for the competition.