Wrote notebook to merge image + tabular + text data in one neural network

I wanted to explore merging image data with tabular and text data using transfer learning for the image and text part and trying to apply everything I learned in the course. @sgugger was kind enough to implement a MixedItemList to be able to “easily” do that.

I used the recently finished PetFinder Kaggle competition since it had several images per pet, tabular information for each pet and also text description. I provide the code for that here:

This is my first Kaggle competition and also my first custom model… So even though I was having 0.42323 quadratic kappa score on my own rudimentary validation set (no fancy cross-validation), I did not do super well on the private leaderboard with 0.25780.

I am looking for comments and inputs on how to make the model better… I had to write custom methods to adapt to the tensor shape produced by MixedItemList, like custom normalization for the images, custom collate function, custom split_layers function. Not sure those are optimal. I would be really interested in having inputs on how to make it better, it could be a good tutorial for people wanting to leverage MixedItemList.


Some general things, I think I would do in this case

  • For images, you train your CNN model independently using only images and CrossEntropyLoss. So when you get the best model that you can get, now remove all the linear layers and use the last conv layer output and flatten, and use it in your other models. (You can use linear layes also, but it is your choice)
  • Generally, you will use ML models like LightGBM or CatBoost which will use the CNN model activations along with other tabular data, to train your final model.

It is a good approach that you followed for the competition.


Thank you for your input! I am still learning data science in general (specially new to kaggle competitions), but I saw that pretty much everyone did what you mentioned, mainly extract features from a CNN, extract features from an RNN for descriptions, then use those with the structured data in a boosting algorithm.

I wanted to explore if we could have better performance by putting everything in one neural network and let it figure it out. I saw that a fairly recent kaggle competition winner used a similar network architecture to mine: https://www.kaggle.com/c/avito-demand-prediction/discussion/59880

My main problem right now is that my model overfitted almost immediately. After 2-3 epochs, the validation loss always became worst than the training loss. Tried several combination of dropout, weight decay, but I could not seem to find the right balance. I guess this is where being a wizard in the black art of deep learning model tuning and debugging becomes handy. Not sure what would be the best way to find the right balance other than just try things randomly until something works.


The problem with training the complete thing as a single model is it is very difficult to train. So as a good practice you should train these models first independently and then join them.

Even in your case, you have mainly 3 different networks, you should first try to train them independently and then join them in a single as you are doing right now.

1 Like

Interesting approach to the competition. As someone who also participated in it, I can tell you a quick improvement would be to drop the Rescuer ID colummn. This is because Rescuer ID has a strong influence on the target, but the training and test set don’t share any ID’s .

1 Like

Hi guys,
I am currently doing my final year project titled “Real-time classification of technological state of machinery”. I need to determine the technological states of a specific type of machines in real time. My input data consists of multiple cameras feeds and real-time production speed as input. My target is manually entered technological machine states. So far I achieved best results using your approach from “Keras: Multiple Inputs and Mixed Data” article. But the resulting model doesn’t perform very well. I think that the main problem is that images are coming from stationary cameras. So my dataset consists of very similar images while in traditional classification problems datasets are very diverse. At this point, I need advice from an experienced ML practitioner on how to tackle my problem. Any help or advice will be very welcome.

This is a super cool model @etremblay. Did you ever develop a method to get predictions for single inputs using learn.predict() or did you stick exclusively to batch predictions from a databunch?

1 Like

Thanks! Unfortunately I was doing this model during my spare time and haven’t had time to come back to it since I posted it. But I wanted to revisit it because some of the solutions high up for PetFinder used a similar architecture. For example 6th place used a very similar model :https://www.kaggle.com/c/petfinder-adoption-prediction/discussion/88690#latest-515029

1 Like

@etremblay, I’ve checked your model - it’s really nice work.
I’m trying to make something similar (using text + text + image models) to your solution, but I’ve got a problem. I want to use sentencepiece processor (SPProcessor()), but it’s not working in your solution - I think because of the architecture of MixedItemList. Maybe you know how to change it? Or maybe you tried this in your model?

Best regards

I’ve got error while creating MixedItemList which looks like:

'SPProcessor' object is not iterable

Hey @kontrabas,

Thanks! Unfortunately I haven’t used this particular library.

Hi @etremblay, Great job on your hybrid model!

I am dealing with a simpler situation that only involves tabular + text. Is your method can be adapted that situation by keeping image part empty (e.g. image = []) or by deleting the image code?

I follow @muellerzr 's approach and came up with the notebook here https://github.com/wjlgatech/fastai-multimodal/blob/main/fastai2_multimodal_tabtxt_public.ipynb

There is still some bug in the last step need to work out. It would be much appreciated to hear your perspectives. Thank you!

I have not looked at this code in 2 years so I am not sure I would be of great help :).

But I saw this library in the past which I think seems very nice: https://github.com/jrzaurin/pytorch-widedeep

Basically you can miss and match tabular, text and image features all in one model. It seems very modular too, so you could probably just pass in tabular + text. It seems to use fastai tokenizer and vocab too.

@etremblay Thank you for pointing me to the resource. It’s very helpful. I will keep you updated on how it works.

1 Like