TabularData - Mixup

abhikjha · August 6, 2019, 6:46pm

Hi

I am trying to implement Mixup in TabularLearner.

Here is the code I wrote:

from fastai.callbacks import *
learn = tabular_learner(data, layers=[1000,500], metrics=accuracy, ps=[0.3,0.2], callback_fns=[MixUpCallback],emb_drop=0.04)

But the error I am getting is as follows:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-399ce5aa3598> in <module>
----> 1 learn.lr_find()
      2 learn.recorder.plot(suggestion=True)

/opt/conda/lib/python3.6/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, wd)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     epochs = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=None, max_noskip:int=1000, dynamic:bool=True, clip:float=None,

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    197         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
    198         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
--> 199         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    200 
    201     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     98             cb_handler.on_epoch_begin()
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
--> 100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
    101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in on_batch_begin(self, xb, yb, train)
    277         self.state_dict.update(dict(last_input=xb, last_target=yb, train=train, 
    278             stop_epoch=False, skip_step=False, skip_zero=False, skip_bwd=False))
--> 279         self('batch_begin', mets = not self.state_dict['train'])
    280         return self.state_dict['last_input'], self.state_dict['last_target']
    281 

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    249         if call_mets:
    250             for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
--> 251         for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
    252 
    253     def set_dl(self, dl:DataLoader):

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in _call_and_update(self, cb, cb_name, **kwargs)
    239     def _call_and_update(self, cb, cb_name, **kwargs)->None:
    240         "Call `cb_name` on `cb` and update the inner state."
--> 241         new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
    242         for k,v in new.items():
    243             if k not in self.state_dict:

/opt/conda/lib/python3.6/site-packages/fastai/callbacks/mixup.py in on_batch_begin(self, last_input, last_target, train, **kwargs)
     18         lambd = np.random.beta(self.alpha, self.alpha, last_target.size(0))
     19         lambd = np.concatenate([lambd[:,None], 1-lambd[:,None]], 1).max(1)
---> 20         lambd = last_input.new(lambd)
     21         shuffle = torch.randperm(last_target.size(0)).to(last_input.device)
     22         x1, y1 = last_input[shuffle], last_target[shuffle]

**AttributeError: 'list' object has no attribute 'new'**

Am I doing something which I should not do or there are some problems in the codes/implementation?

muellerzr · August 6, 2019, 6:49pm

Mixup is used for image data, not tabular. See the docs here: mixup

abhikjha · August 6, 2019, 6:53pm

Hey Zach

So nice to see your prompt reply.

I am referring @sgugger 's reply here:

muellerzr · August 6, 2019, 6:56pm

Interesting!!! I realized my mistake here, my apologies Most likely some updates need to be done for specifically the categorical embeddings is what I’m thinking…

For tabular models, the data is stored in three arrays (hence list) so a modification would be needed to go through each

abhikjha · August 6, 2019, 7:00pm

No need for apologies

In CNN this technique is so useful, it definitely should have been implemented in Tabular Model. It can proved to be a fabulous augmentation technique in Tabular Model. Hope @sgugger finds some time to educate us

muellerzr · August 6, 2019, 7:18pm

Here is a somewhat working version. I’m unsure if this is quite what @sgugger means by shuffling the matrix, but he can let me know:

class TabMixUpCallback(LearnerCallback):
    "Callback that creates the mixed-up input and target."
    def __init__(self, learn:Learner, alpha:float=0.3, stack_x:bool=False, stack_y:bool=True):
        super().__init__(learn)
        self.alpha,self.stack_x,self.stack_y = alpha,stack_x,stack_y
    
    def on_train_begin(self, **kwargs):
        if self.stack_y: self.learn.loss_func = MixUpLoss(self.learn.loss_func)
        
    def on_batch_begin(self, last_input, last_target, train, **kwargs):
        "Applies mixup to `last_input` and `last_target` if `train`."
        if not train: return
        new_input = []
        lambd_gnd = np.random.beta(self.alpha, self.alpha, last_target.size(0))
        lambd_gnd = np.concatenate([lambd_gnd[:,None], 1-lambd_gnd[:,None]], 1).max(1)
        
        shuffle = torch.randperm(last_target.size(0)).to(last_input[0].device)
        y1 = last_target[shuffle]
        
        lambd = last_input[0].new(lambd_gnd)
        x1 = last_input[0][shuffle]
        out_shape = [lambd.size(0)] + [1 for _ in range(len(x1.shape) - 1)]
        lambd = tensor(lambd)
        new_input.append((last_input[0] * lambd.view(out_shape) + x1 * (1-lambd).view(out_shape)))
        new_input.append(last_input[1])
        new_target = torch.cat([last_target[:,None].float(), y1[:,None].float(), lambd[:,None].float()], 1)
        return {'last_input': new_input, 'last_target': new_target}  
    
    def on_train_end(self, **kwargs):
        if self.stack_y: self.learn.loss_func = self.learn.loss_func.get_old()

However I did notice a loss in accuracy and not much improvement if any on some quick tests (rossmann bucketed and Adults)

KarlH · August 6, 2019, 9:21pm

I don’t think you can do tabular mixup via a callback like it’s done for images. The reason is your tabular data likely contains categorical information stored as integer codes that are put through an embedding. Using mixup to interpolate between categorical codes doesn’t really make sense, and wouldn’t be compatible with integer indexing into an embedding.

You would have to implement mixup in the forward pass of your model after your input data is fully vectorized.

muellerzr · August 7, 2019, 2:02am

@KarlH, thank you for that insight! You were correct, I had to attach it to the forward pass, and then I could use the callback (kinda). for anyone that wants to test, it seems to be operating as it should. Let me know if you see any issues with how it’s done. @abhikjha and KarlH, my repo is here with my notebook and the new tabular model etc: Github

Let me know if you see any major issues with it. I did not see any improvement on either the Adults nor Rossmann problems.

Pak · August 7, 2019, 9:42am

Hi.
Sorry for a stupid question, but can you please explain me (the general idea) how mixup should work in tabular data?
As I understand the main principle behind mixup is this: we take for ex a picture that is interpolated with other picture and the answer for a model should be – ranking these 2 classes (correct ones for every of these two pictures) higher than others (and also maybe we get much more data as you now have combinatorics on your side). Ok.
Then I can hardly understand how it can work in NLP (language model), as you have to predict the next word, so what do you feed your network? linear interpolation of numerical representation of 2 of your sentences and expect 2 specific words (from 1st and 2nd sentence) as an output (2 most probable words)? Numerical interpolation of a sentence doesn’t feel right representing something this meaninful.
And tabular data feel to be very similar case in this sense.
On the other hand if there is a proof that it does work for NLP, it feels it should work for tabular as well. Do you know some successful examples of using it in NLP (language models)?

muellerzr · August 7, 2019, 12:09pm

@Pak, a quick search led me here: https://arxiv.org/abs/1905.08941

They describe utilizing mixup on the sentence and word embeddings. I need to read through this as well to get an understanding of what’s going on as well, perhaps we can find the answer together? (Unless sgugger pops his head in, as he stated he got it working for nlp)

Pak · August 7, 2019, 1:46pm

So let’s sum up what to try (let’s imagine it’s rossmann data for simplicity, let’s also assume we apply 50/50 mixup):

we get 2 rows from our tabular data (with 2 corresponded logs of depended variables)
we feedforward our category variables through embennings layer (and get 2 sets of embedings outputs)
we mixup (blend) these outputs with each over (averaging corresponding numbers in our 50/50 case)
we mixup cont variables with each other (also averaging here)
we feed forward the rest of layers with these values
and what should we get as our final output? the average of (log of) 2 depended variables
Am I rightly understand the Mixup idea?

abhikjha · August 7, 2019, 2:18pm

Thanks Zach for this. I will see that and let you know if I get stuck at some place…

muellerzr · August 7, 2019, 2:32pm

@Pak I’d look at the loss function for regular mixup. Essentially it’s a blend (30% class x, 70% class y)

abhikjha · August 7, 2019, 2:33pm

@pak - thanks for summarizing this. However, in my view, mixup should only happen for categorical embeddings and not for continuous variables. That’s what Sylvian’s message also states (the link I gave above)

Pak · August 7, 2019, 2:35pm

~~Yes, but we have a continuous depended variable in Rossmann case.~~
Now I see that we can treat 2 cont vars as classes in terms of loss function

Pak · August 7, 2019, 2:43pm

Ok, but what values for cont vars we should take then? And I think if we take row1 cont vars and 50/50 of cat vars I think we should not get the result of of 50/50 depended variable (as rhete will be more data from row1 than row2 in this case).
And afterall blending cont vars intuitively make more sense for me, as average of for ex 2 distances has it’s sense, we can understand it, which is much harder for averaging 2 tensors (embeddings)

muellerzr · August 7, 2019, 8:42pm

My notebook focuses on the classification aspect, not a regression based aspect, so I can’t quite commend on the best practice for that. However in terms of a classification, the plan of attack is something like the following then:

Get two embedding outputs and “blend” (like how it is right now)
Blend the continuous variables together by averaging them
The same output as regular mixup classification (80% x, 20% y)

For regression, I think we would need to play around with the mean, having just two y’s, etc to see which would really see how best it would work. But mixup was originally intended and used for classification based problems. Does this help @Pak?

muellerzr · August 8, 2019, 1:01am

@pak see this paper here:

They found that mixup improved the accuracy for for four of the six datasets.

Pak · August 8, 2019, 8:20am

Thanx, Zachary.
I’ll look into it

Pak · August 8, 2019, 9:52am

By the way, in fact, I think we do can use mixup as a callback for tabular data. We just can think about the model in a different way.
If we split our model into 2 parts embedding + rest_of_the_model, then we can use the second part as a model and just shift the input of it (in dataloader or callback). We just pass out initial data though embeddings layer and then blend the result. That will be our inputs. I think it’s fair to call in this case the second part (rest_of_the_model) as ‘the model’ as only this part can be trained (I cannot think of the way how to train embeddings as well in a mixup). And feedforward though embeddings is now just a part of preprocessing step.
Definetly, first of all we have to train your model in a normal way, as we want to produce our embeddings. Then we can use the_rest_of_the_model and retrain it or throw it away and use only embeddings (and new the_rest_of_the_model) for a mixup training.