A walk with fastai2 - Tabular - Study Group and Online Lectures Megathread

Sure thing! So in regards to what this will presume, you know of the MixedDL here

So now let’s go through our steps.

  1. We’ll build some Tabular DL’s and vision DL’s we wish to make for our MixedDL.
  2. When we get to the Tabular portion, we will want to calculate the embedding matrix size. We do this with get_emb_sz(to) (with the to object being dl.train on the Tabular DL)
  3. We’ll make a Tabular Embedding only model, as this is all we want. this code looks like so:
class TabularEmbeddingModel(Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs, embed_p=0.):
        ps = ifnone(ps, [0]*len(layers))
        self.embeds = nn.ModuleList([Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(embed_p)

    def forward(self, x_cat, x_cont=None):
        if self.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
            x = torch.cat(x, 1)
            x = self.emb_drop(x)
        return x

All this model does is take our input (which must be a tabular cat+cont if we’re following that example) (if there is no continuous it passes in an empty tensor)

So now we can build our model by passing in the emb_sz

  1. Now we need our vision model. Both of these models can be thought of as “bodies”, and we’ll make a head for them all. So for our Vision model, we’ll call create_body(resnet50) and this is the body of our model
  2. Now we get to the meat and potatoes. We have two bodies at this point, we need to make it into a cohesive model. First thing we want to do is concatenate their outputs before passing it to some head. But how do we calculate this? We’ll take both our models and call num_features_model(model). For instance a resnet50 will have 2048. We’ll pretend our other model has an output of 224. As a result, post concatenation we can presume the size would be 2048+224
  3. Now we can call create_head(2048+224, num_classes) to create our head. Finally, we need to define a model. This model should accept both of our bodies as an input, calculate a head, and then in the forward function take care of everything:
class MultiModalModel(Module):
    def __init__(self, tab_body, vis_body, c):
        self.tab, self.vis = tab_body, vis_body
        nf = num_features_model(self.tab) + num_features_model(self.vis)
        self.head = create_head(nf*2, c)

    def forward(self, *x):
        cat, cont, vis = x
        tab_out = self.tab(cat, cont)
        vis_out = self.vis(vis)
        y = torch.cat((tab_out,vis_out), dim=1)
        y = self.head(y)
        return y

And now we have a model that can train based on our inputs!

Now of course if you wanted to use transfer learning and differential learning rates on that resnet, your splitter should split based on the layer names (self.vis vs everything else)

This help? :slight_smile:

2 Likes

Hi all. Just finished up the bayesian optimisation lecture. It seems almost too good like a to be true in terms of hyper parameter tuning, almost like a free lunch.

Q: Is there any reason why I wouldn’t want to do it to find my optimal hyperparameters?

For tabular, sure it’s quick. But with other applications it can take hours to days to finish/find the optimum. This is why we have Lr finder, etc

Ah gotcha @muellerzr thank you. I was reading and saw it could suffer runtimes but wasn’t sure what types of problem it would struggle with.

I’ve been playing around with the house price kaggle table dataset and have gotten up to the test dataloader but I’m having issues

dl = learn.dls.test_dl(test)

The error
AssertionError: nan values inBsmtFinSF1 but not in setup training set

After reading around the forum I understand that my test data has missing data in that column but the training one doesn’t. Is there a way to process my test data to accommodate for this difference between the two datasets?

I was thinking of adding a row of blanks to the training set so that when creating the TabularPandas with the training set it would apply the preprocessing.

Any thoughts/solutions?

NOTE: I haven’t included the other code as it’s almost identical to the tabular examples from the lectures.

No. That won’t work well in my experience, if the model uses it and it’s not there, then it can’t make sense of the data. So you cannot use that particular input value. If it’s feature engineered you need to derive this feature in your test data as well

Ok thanks Zach. So the options are…

  • “unengineer”
  • Drop these rows from the test data (dataset and volume of issues dependent)
  • Fill them with appropriate method e.g. Mode of a column
  • Consider the feature(s) importance which could lead to the decision of dropping them

Q: How does fastai treat a column e.g. continuous values 1.5, 2.0, where if I filled it with a string ‘NA’ when it comes for preprocessing?

I understand you can set the continuous names when creating the TabularPandas but does it process the above example appropriately even if I filled it with 'NA? Reason I ask is I’m considering an appropriate way (as listed above) to handle the data. Perhaps it could be filled with 0s but then I’m thinking of the normalising that could occur.

So much to think about! :smiley:

Hi,
How do you get the fastai version? And how do you get the fastcore version? Also how would you upgrade/downgrade them? What can I run on the terminal for that?

pip show fastai

pip show fastcore

pip install fastai --upgrade (for upgrading)
pip install fastai==someversion (for downgrading/installing a specific version)

1 Like