Good dataset to explore mixing Images, tabular and text data?

I am looking for an open dataset to write a tutorial on how to use the new MixedItemList. Ideally the dataset would contain images, text associated to those image and tabular data with either a classification or regression task. I wrote a working example with private data but would like to write a tutorial with an open dataset so people could test my code with it.


Kaggle pet classifier
It has images of the pets their description that is in Text Format and some Tabular data as well containing some more info.

1 Like

The PetFinder competition should have all 3 datatypes that you’re looking for


I’m very excited to see what you’ve been up to with this. I tried some of the code for the mixed model that you shared in other threads (Pre-trained text encoder), but just wasn’t able to get it to work with text.

Hey, sorry for the delay. I am working on this during nights and week-ends and I am having difficulty finding time.

But I got a model training using Image + tabular + text to train on a private dataset from work. Unfortunately adding the text part didn’t really add much more accuracy and added a lot of complexity to my code. So I want to test it using some other dataset to see if this could help because my text fragments for each row of my dataset are not that long.

But in summary, we have to merge the different source of data into it’s own databunch. To do that we have a new MixedItemList which can put several ItemList together. So for example:

imgList = ImageList.from_df(pictures, path=path, cols='PicturePath')
tabList = TabularList.from_df(pictures, cat_names=cat_names, cont_names=cont_names, procs=procs, path=path)
textList = TextList.from_df(pictures, cols='NameDefault', path=path, vocab=vocab)

mixed = (MixedItemList([imgList, tabList, textList], path, inner_df=imgList.inner_df)
            .transform([[get_transforms()[0], [], []], [get_transforms()[1], [], []]], size=size))

data = mixed.databunch(bs=bs, collate_fn=collate_mixed)
data.add_tfm(norm) # normalize images

There’s two things here I need to explain more, collate_mixed_noimage and data.add_tfm(norm). For data.add_tfm, you have to write your own image normalization function because they are now mixed in a tensor of images, tabular and text data.

I define this normalization function like so:

def _normalize_images_batch(b:Tuple[Tensor,Tensor], mean:FloatTensor, std:FloatTensor)->Tuple[Tensor,Tensor]:
    "`b` = `x`,`y` - normalize `x` array of imgs and `do_y` optionally `y`."
    x,y = b
    mean,std =[0][0].device),[0][0].device)
    x[0][0] = normalize(x[0][0],mean,std)
    return x,y

def normalize_custom_funcs(mean:FloatTensor, std:FloatTensor, do_x:bool=True, do_y:bool=False)->Tuple[Callable,Callable]:
    "Create normalize/denormalize func using `mean` and `std`, can specify `do_y` and `device`."
    mean,std = tensor(mean),tensor(std)
    return (partial(_normalize_images_batch, mean=mean, std=std),
            partial(denormalize, mean=mean, std=std))

norm, denorm = normalize_custom_funcs(*imagenet_stats)

Then you have to write a collate function, collate_mixed. Since we have variable length text in our batches, we have to make all batches of the same size. So you find the longest sentence in your batch and make every item in this batch the same length:

def collate_mixed(samples, pad_idx:int=0):
    # Find max length of the text from the MixedItemList
    max_len = max([len(s[0].data[2]) for s in samples])

    for s in samples:
        res = np.zeros(max_len + pad_idx, dtype=np.int64)
        res[:len(s[0].data[2])] = s[0].data[2]
        s[0].data[2] = res

    return data_collate(samples)

Then you need to make a model that can actually process the data your databunch is going to provide. My code is really not pretty at the moment for that because I was just prototyping. But if you are interested here it is until I clean it up:

class ImageTabularTextModel(nn.Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs:ListSizes, n_cont:int, layers:Collection[int], vocab_sz:int, encoder):
        self.cnn = create_body(models.resnet34)
        nf = num_features_model(self.cnn) * 2
        l = [400 * 3] + [512]
        ps = [.4]
        self.lm_encoder = SequentialRNN(encoder[0], PoolingLinearClassifier(l, ps)) = TabularModel(emb_szs, n_cont, 512, layers)

        self.reduce = nn.Sequential(*([AdaptiveConcatPool2d(), Flatten()] + bn_drop_lin(nf, 512, bn=True, p=0.5, actn=nn.ReLU(inplace=True))))
        self.merge = nn.Sequential(*bn_drop_lin(512 + 512 + 512, 512, bn=True, p=0.5, actn=nn.ReLU(inplace=True))) = nn.Sequential(*bn_drop_lin(512, 2, bn=True, p=0., actn=nn.ReLU(inplace=True)))

    def forward(self, img:Tensor, x:Tensor, text:Tensor) -> Tensor:
        imgLatent = self.reduce(self.cnn(img))
        tabLatent =[0], x[1])
        textLatent = self.lm_encoder(text)[0]

        cat =[imgLatent, tabLatent, textLatent], dim=1)

    def reset(self):
        for c in self.children():
            if hasattr(c, 'reset'): c.reset()

Then you need a custom learner with some additional goodies, like an RNNTrainer to reset the weights of the rnn between epochs and you also need a split function to split the layers for the fastai freeze function to work…

def split_layers(model:nn.Module) -> List[nn.Module]:
    groups = [[model.cnn, model.lm_encoder]]
    groups += [[, model.reduce, model.merge,]]
    return groups

class ImageTabularTextLearner(Learner):
    def __init__(self, data:DataBunch, model:nn.Module, alpha:float=2., beta:float=1., **learn_kwargs):
        super().__init__(data, model, **learn_kwargs)
        self.callbacks.append(RNNTrainer(self, alpha=alpha, beta=beta))

For the record I had problems with RNNTrainer… I made a custom version of it where it only reset the weights after an epoch because the other functions in it were making it crash… I might have to revisit that in the future, this may be why adding text to my model didnt really make a difference.

And then the last piece is a method creating the learner but also reusing stuff from fastai to get a pre-trained language model… The way I do it is fairly hacky… I use the fastai method text_classifier_learner to get a text classifier learner and just grab the model from it and pass it to my own model instance so that it can use the encoder from it.

def image_tabular_text_learner(mixed, data, len_cont_names, vocab_sz, data_lm):
    l = text_classifier_learner(data_lm, AWD_LSTM, drop_mult=0.3)

    emb = mixed.train.x.item_lists[1].get_emb_szs()
    model = ImageTabularTextModel(emb, len_cont_names, [1000, 500], vocab_sz, l.model)

    learn = ImageTabularTextLearner(data, model, metrics=[accuracy],
                    callback_fns=[partial(EarlyStoppingCallback, monitor='accuracy', min_delta=0.005, patience=3)])
    return learn

I know that the code is all over the place, but until I find time to put all this together in a coherent notebook I hope this can help you investigate things on your side.


1 Like

Another small note is that TabularModel from fastai doesn’t include an activation for the last layer (there’s no RELU) because usually this is the last layer of the neural network for only a tabular model. So since I am using it as a sub-module in my own model this is a problem. I made a custom class where I add it manually, but I think the best way would be to make a pull request to fastai to have a optional parameter to add it or not.

For anyone interested, I released my complete code for this and I am looking for any input on how to make it better: