How to hande hiearchical tabular data in fastai v2

I am fairly familiar with fastai v1, but I am looking for resources that could help me get started with v2 in a scenario where I can’t really use the built-in high level tabular data loaders. My tabular data has a parent child kind of relationship.

We have questionnaires that get answered for specific stores. We want to predict the score of this questionnaire the next time we go to this store. We have the individual answers for the questions for the questionnaire.

The number of questions per questionnaire can vary and the order of the questions don’t matter for the final score. With all the questions and the higher level information about the questionnaire we want to predict the score next time we visit this store.

Any pointers to articles, tutorials, interesting notebook I could read to learn more about making custom data loaders and collate function for this kind of data structure.

Thanks!

1 Like

Note that the TabularDataLoader is only there to provide a fast iterator by reading the batch directly in the dataframe. The data block API or the mid-level API can probably help you to assemble your data, you just need to write the transforms you need since it seems a bit custom. I’m not sure you will need a subclass of DataLoader, any kind of special padding you would need if the samples are not of the same size can be implemented in a before_batch hook. For tutorials look at

For custom padding, look at the source code of bb_pad_collate or pad_input.

2 Likes

Thanks Sylvain!

Is there a DataBlock for tabular data? I could not find one by looking at the code and it says todo in the examples notebook.

With a tabular data block I could do something like this I suppose (where I could implement the tabular blocks):

DataBlock(blocks=(TabularBlock, TabularMultilinesBlock, RegressionBlock),
get_items=get_tabular,
splitter=RandomSplitter(),
n_inp=2)

Where TabularBlock could return one line from a pandas dataframe, TabularMultilinesBlock could return multiple lines corresponding to the parent returned in the first TabularBlock… Then I could collate all this in a before_batch.

Am I over complicating things?

Thanks,

No there is none yet, and it’s not a priority right now as it’s only useful in multimodal settings (I think we’ll add this at some point, but probably not before the next course). But the transforms at the very end of tabular.core should be a good base for anyone who want to implement a TabularBlock.

1 Like

Alright I will look at that. You are talking about class ReadTabBatch(ItemTransform) right?

By multimodal you mean merging different kind of data right? I wanted to eventually re-implement to fastai2 my frakenstein notebook that merged image, text and tabular data for the PetFinder competition using fastai v1. This could be a fun project.

Thanks!

2 Likes

No ReadTabBatch is the one that works for batches directly, you would need the ReadTabLine at the end, more likely.

Got it, I was reading the generated code in VS code… Easier for navigation, but I guess I lose some things that are not exported. Thanks for the clarification.

Ok so I got the dataloader to work. Basically my model needs information about the parent row and an array of the children rows. For the parent row I send the cat and cont tensor. And for the array of children row, for each row I send the cat, cont tensor too.

Here is the code for that if anyone is interested… It contains some specific stuff to my domain, like SurveyResultID stuff, but could be adapted by someone else if need be. SurveyResultID is simply the common key between my parent and children.

questions_tab = TabularPandas(questions, [Categorify, FillMissing, Normalize], questions_cat_names, questions_cont_names)
results_tab = TabularPandas(results, [Categorify, FillMissing, Normalize], results_cat_names, results_cont_names, y_names='NextScore')

class ReadMultiTabBatch(ReadTabBatch):
    def encodes(self, to):
        parents = super(ReadMultiTabBatch, self).encodes(to[0])
        children = [super(ReadMultiTabBatch, self).encodes(x) for x in to[1]]
        max_len = max([len(c[0]) for c in children])
        for i, c in enumerate(children):
            cat, cont = c[0], c[1]
            new_cat = torch.zeros(max_len, cat.shape[1]).long()
            new_cat[:len(c[0])] = c[0]
            new_cont = torch.zeros(max_len, cont.shape[1]).float()
            new_cont[:len(c[1])] = c[1]
            children[i] = (new_cat, new_cont)
        
        return parents[:-1], children, parents[-1]

@delegates()
class TabParentChildDataLoader(TfmdDL):
    do_item = noops
    def __init__(self, dataset, children, bs=16, shuffle=False, after_batch=None, num_workers=0, **kwargs):
        if after_batch is None: after_batch = L(TransformBlock().batch_tfms)+ReadMultiTabBatch(dataset)
        super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
        self.children = children

    def create_batch(self, b):
        parents = self.dataset.items.iloc[b]
        c = self.children.items
        c = c[c['SurveyResultID'].isin(parents['SurveyResultID'])]
        res = [self.children.iloc[c[c.SurveyResultID == x].index] for x in parents.SurveyResultID]
        return self.dataset.iloc[b], res

splits = IndexSplitter(valid_indexes.tolist())(range_of(results))
dl = TabParentChildDataLoader(results_tab, questions_tab, splits=splits)
ds = Datasets(dl, splits=splits)
dls = ds.dataloaders()
2 Likes

I don’t have that much experience with Pytorch, but I am trying to figure out the best way to implement the paper Deep Sets for my children elements

Basically the children elements don’t have any order (they are questions in a questionnaire). Deep Sets is simply running a NN on each element of the set (in my case questions), take the final representation of the NN for each elements of the set and sum them up.

I thought I could simply make a loop, but the model receives a batch of data… So this seems inefficient…

My model is defined like this so far:

results_emb_szs = get_emb_sz(results_tab)
questions_emb_szs = get_emb_sz(questions_tab)

class ParentChildModel(Module):
    def __init__(self):
        self.questions = TabularModel(questions_emb_szs, len(questions_cont_names), 100, [1000, 250], ps=[0.01, 0.1], embed_p=0.04)
        self.results = TabularModel(results_emb_szs, len(results_cont_names), 100, [1000, 250], ps=[0.01, 0.1], embed_p=0.04)

    def forward(self, data, children):
        parent_cat, parent_cont = data[0], data[1]
        results = self.results(parent_cat, parent_cont)
        
        return results

children in forward contains a batch of arrays of questions. For each element in the batch in children, for each questions in the array, I want to pass it through self.questions model, then sum the representations.

Then concat that with the result from the self.results model and pass that to a last final linear layer to predict the final score of the questionnaire…

The only thing stopping me is how to handle the children…