TabularData - Mixup

@Pak, a quick search led me here: https://arxiv.org/abs/1905.08941

They describe utilizing mixup on the sentence and word embeddings. I need to read through this as well to get an understanding of what’s going on as well, perhaps we can find the answer together? (Unless sgugger pops his head in, as he stated he got it working for nlp)

1 Like

So let’s sum up what to try (let’s imagine it’s rossmann data for simplicity, let’s also assume we apply 50/50 mixup):

  • we get 2 rows from our tabular data (with 2 corresponded logs of depended variables)
  • we feedforward our category variables through embennings layer (and get 2 sets of embedings outputs)
  • we mixup (blend) these outputs with each over (averaging corresponding numbers in our 50/50 case)
  • we mixup cont variables with each other (also averaging here)
  • we feed forward the rest of layers with these values
  • and what should we get as our final output? the average of (log of) 2 depended variables
    Am I rightly understand the Mixup idea?
1 Like

Thanks Zach for this. I will see that and let you know if I get stuck at some place…

@Pak I’d look at the loss function for regular mixup. Essentially it’s a blend (30% class x, 70% class y)

@pak - thanks for summarizing this. However, in my view, mixup should only happen for categorical embeddings and not for continuous variables. That’s what Sylvian’s message also states (the link I gave above)

Yes, but we have a continuous depended variable in Rossmann case.
Now I see that we can treat 2 cont vars as classes in terms of loss function

Ok, but what values for cont vars we should take then? And I think if we take row1 cont vars and 50/50 of cat vars I think we should not get the result of of 50/50 depended variable (as rhete will be more data from row1 than row2 in this case).
And afterall blending cont vars intuitively make more sense for me, as average of for ex 2 distances has it’s sense, we can understand it, which is much harder for averaging 2 tensors (embeddings)

My notebook focuses on the classification aspect, not a regression based aspect, so I can’t quite commend on the best practice for that. However in terms of a classification, the plan of attack is something like the following then:

  • Get two embedding outputs and “blend” (like how it is right now)
  • Blend the continuous variables together by averaging them
  • The same output as regular mixup classification (80% x, 20% y)

For regression, I think we would need to play around with the mean, having just two y’s, etc to see which would really see how best it would work. But mixup was originally intended and used for classification based problems. Does this help @Pak?

@pak see this paper here:

They found that mixup improved the accuracy for for four of the six datasets.

2 Likes

Thanx, Zachary.
I’ll look into it

By the way, in fact, I think we do can use mixup as a callback for tabular data. We just can think about the model in a different way.
If we split our model into 2 parts embedding + rest_of_the_model, then we can use the second part as a model and just shift the input of it (in dataloader or callback). We just pass out initial data though embeddings layer and then blend the result. That will be our inputs. I think it’s fair to call in this case the second part (rest_of_the_model) as ‘the model’ as only this part can be trained (I cannot think of the way how to train embeddings as well in a mixup). And feedforward though embeddings is now just a part of preprocessing step.
Definetly, first of all we have to train your model in a normal way, as we want to produce our embeddings. Then we can use the_rest_of_the_model and retrain it or throw it away and use only embeddings (and new the_rest_of_the_model) for a mixup training.

1 Like

Mixup didn’t help much in my case. Validation error stayed pretty the same (even worse for a bit, but maybe it can be slightly better after some finetuning).
Here’s what I did:
First of all I’ve preprocessed the data (normalized, categorized and fill missing if needed). Then I took a model and fed it all my dataframes (train and validation sets) to embedding layers only (so I’ve got embeddings output for all the data). Then I’ve concat these values with cont values. So to this moment I had a bunch of floats for each row of data (cat values numericalized and cont as is). This is what our NN (apart from embeddings) really gets as input. Then I’ve blended (interpolated) all the values (all independent variables are floats now and my dependent variable is a number, not a class).

def interp(var1, var2, alpha):
lam = np.random.beta(alpha, alpha)
return lam*var1 + (1.-lam)*var2

Now I can pretend that these floats are just a bunch of cont values and try to teach NN in a normal way (without preprocessing (!)).
The last step is to validate models with validation set (fed through embedding as well)
I’ve already had a function that makes all the preprocess and outputs the ‘real model input’. I’ve made it for Random Forrest with embedding case (RF vs NN) in https://github.com/Pak911/fastai-shared-notebooks/blob/master/interpret_tabular.ipynb

And… As I said I’ve got an error slightly worse than initial model (trained in a normal way).
As for clarity my dataset consists of only about 10 000 rows.
It has 53 category and 26 continuous variables.
And my task is regression, not classification.

1 Like

Thanks @Pak! I’m very interested in trying this for classification too to see if it helps. I have a few different ones to try with various class nums (17/78) and is full of only categorical. Plus a few that have both. (Although it seems it wasn’t pushed to github quite yet?)

No I didn’t put it on github as a) the code is quite messy, I just wanted to test the theory quick and dirty way :slight_smile: b)it was done with my dataset which I cannot open for now.
But if you are interested in I can reimplement it with something more common, Rossmann for ex. ant post it on github.

1 Like

I would very much so appreciate that :slight_smile: or even just the “messy” code too.

I recently discovered a thesis about tabular data and deep learning that used the fastai library. I’m working on recreating everything they did as it’s full of fascinating things. If you’re interested I’ll link it. Mixup was included along with a number of other ideas

(Sadly while they included ‘source code’ the code for some of the experiments was not there)

I’ve made my Rossmann mixup version available on github here
I did not fix the messiness :slight_smile:
In fact it is so messy (and uses lists, not iterators) than I had to use only 30% of data to fit augmentation in my memory.
Hope your experiments will go better than mine and we will be able to use mixup in tabular data (or other augmentation techniques) :slight_smile:

2 Likes

Thanks Pak!!! I’ll look at this immediately and report back any findings :slight_smile:

Thanks for sharing this.

I have one generic question:

In tabular learner, we have two modules (like the one shown in pic):

  1. Categorical embeddings
  2. Network of Dense layers

My question is if I pull out or extract the representations from the layer 4 here using Hooks i.e. second last layer before the final layer, will the learning of embeddings be also captured in this layer? Basically, I want to use Hooks to extract the learning in the second last layer and run RF / GBM on that to see if overall performance can be improved. I want to make sure that while I extract the learning from this layer, I don’t miss the categorical embeddings from this.

Hi. Yes, if you put a hook into layer 4 you will get the result activations that gone through all the layers before it, including embeddings layer (I’ve used this method for selfchecking when I used learnt NN embeddings in RF).

1 Like

I just wrote an implementation of manifold mixup that should work out of the box on tabular data for both regression and classification (and, also, let you inject the mixup at particular places if you want to) :

1 Like