Porto Seguro Winning Solution -- Representation learning


(Louis Guthmann) #1

Hi,

The porto seguro competition is just over and the winning solution involved a very interesting non supervised approach for Tabular Dataset. Michael Jahrer, the winner, was also among the members of the winning team of the last 2009 Netflix competition.

We could implement this denoising approach to the structured api in a way similar to LanguagModel in NLP.


Wiki: Lesson 6
(Jeremy Howard) #2

This is absolutely fascinating. He’s invented a whole new important technique - semi-supervised learning for structured data. Really really cool. Thanks for sharing!


(sergii makarevych) #3

Mind blowing reading. There is no spoon :sunglasses:

  • Everything I’ve done here end-to-end was written in C++/CUDA by myself
  • Since NIPS2016 I was able to code GANs by myself

(Kevin Bird) #4

The critical part here is to invent the noise. In tabular datasets we cannot just flip, rotate, sheer like people are doing this in images. Adding gaussian or uniform additive / multiplicative noise is not optimal since features have different scale or a discrete set of values that some noise just didnt make sense. I found a noise schema called “swap noise”. Here I sample from the feature itself with a certain probability “inputSwapNoise” in the table above. 0.15 means 15% of features replaced by values from another row. Two different topologies are used by myself. Deep stack, where the new features are the values of the activations on all hidden layers. Second, bottleneck, where one middle layer is used to grab the activations as new dataset. This DAE step usually blows the input dimensionality to 1k…10k range.

So is this basically just saying to augment the data by taking different values from the column and replacing the values with one of those values instead in 15% of the rows? So is this basically how to manipulate data when it’s not an image and nice and pixely?


(Louis Guthmann) #5

I actually tried something similar, some kind of knowledge distillation but I could not pull it off.
I think I introduced leakage by using too many example from the previous model.

His solution is simply amazing :slight_smile:


(sergii makarevych) #6

Can you please share some links to articles about these autoencoders etc. I am complete newbie here.


(Devan Govender) #7

Agree, it is an innovative solution.

Ian Goodfellow presented a comprehensive semi-supervised GAN tutorial in the Udacity DL course. This was based on unstructured data.

I was able to convert that DCGAN-type architecture into a GAN with fully-connected layers for the Porto Seguro project but failed miserably at interpreting the output.


(Ben Eacrett) #8

Thank you so much for linking this - covers exactly what I’ve been looking for this week!


(Jeremy Howard) #9

I just did a search and couldn’t find any useful tutorials online :frowning: I’ll try to briefly cover it on Monday - remind me if I forget (cc @yinterian)


(Kerem Turgutlu) #10

Here is a useful resource for DA s: http://deeplearning.net/tutorial/dA.html

I am now replicating this approach, I have question about training the autoencoder. Should we still have a training and validation set and choose best representations based on the lowest validation loss ?

Thanks


(Kerem Turgutlu) #11

And another doubt I have in mind is this part:

The best what I found during the past and works straight of the box is "RankGauss". Its based on rank transformation. First step is to assign a linspace to the sorted features from 0..1, then apply the inverse of error function ErfInv to shape them like gaussians, then I substract the mean

But I get gaussian dist when I do something like this:

def to_gauss(x): return np.sqrt(2)*erfinv(x) 

def normalize(data, exclude=None):
norm_cols = [n for n, c in data.drop(exclude, 1).items() if len(np.unique(c)) > 2]
n = data.shape[0]
for col in norm_cols:
    sorted_idx = data[col].sort_values().index.tolist()
    uniform = np.linspace(start=-0.99, stop=0.99, num=n)
    normal = to_gauss(uniform)
    normalized_col = pd.Series(index=sorted_idx, data=normal)
    data[col] = normalized_col
return data

Here is a resource for Erfinv: https://www.mathworks.com/help/matlab/ref/erfinv.html


(WG) #12

+1 for the Matrix reference.


(Kerem Turgutlu) #13

After training the autoencoder, say linear 221, 1500, 1500, 1500, 221. Is the next step to compute activations for each datapoint and storing them as their new features ?

In this case: 4500 new features -> activations stacked for each row.

Yes it is.

I have trained the autoencoder, now I am trying to extract features (stacked 4500 activations of each forward pass of input), but getting the following error. Any tips ?

Thanks

class FeatureExtractor(nn.Module):
    def __init__(self, submodule, extracted_layers):
        super().__init__()
        self.submodule = submodule
​
    def forward(self, x):
        outputs = []
        for name, module in self.submodule._modules.items():
            x = module(x)
            if name in self.extracted_layers:
                outputs += [x]
        return outputs
In [241]:

.eval()
fextractor = FeatureExtractor(da.eval(), extracted_layers=['fc0', 'fc1', 'fc2'])
In [245]:

np.array
# Get activations for input_
input_arr = np.array(input_)
In [256]:



  # example datapoint for feature extraction
    fextractor(torch.from_numpy(input_arr[0]).float().unsqueeze(0))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-256-36fab960cefd> in <module>()
----> 1 fextractor(torch.from_numpy(input_arr[0]).float().unsqueeze(0))

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
222 for hook in self._forward_pre_hooks.values():
223 hook(self, input)
–> 224 result = self.forward(*input, **kwargs)
225 for hook in self._forward_hooks.values():
226 hook_result = hook(self, input, result)

in forward(self, x)
7 outputs = []
8 for name, module in self.submodule._modules.items():
----> 9 x = module(x)
10 if name in self.extracted_layers:
11 outputs += [x]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
222 for hook in self._forward_pre_hooks.values():
223 hook(self, input)
–> 224 result = self.forward(*input, **kwargs)
225 for hook in self._forward_hooks.values():
226 hook_result = hook(self, input, result)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
51
52 def forward(self, input):
—> 53 return F.linear(input, self.weight, self.bias)
54
55 def repr(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
551 if input.dim() == 2 and bias is not None:
552 # fused op is marginally faster
–> 553 return torch.addmm(bias, input, weight.t())
554
555 output = input.matmul(weight.t())

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/variable.py in addmm(cls, *args)
922 @classmethod
923 def addmm(cls, *args):
–> 924 return cls._blas(Addmm, args, False)
925
926 @classmethod

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/variable.py in _blas(cls, args, inplace)
918 else:
919 tensors = args
–> 920 return cls.apply(*(tensors + (alpha, beta, inplace)))
921
922 @classmethod

RuntimeError: save_for_backward can only save input or output tensors, but argument 0 doesn’t satisfy this condition


(Ben Eacrett) #14

I’m hoping this blog post will be decent - it seems to cover a few architectures - going to work through it this weekend.
https://blog.keras.io/building-autoencoders-in-keras.html


(Jeremy Howard) #15

Absolutely :slight_smile:


(Jeremy Howard) #16

Yes! You can use a hook: http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html#forward-and-backward-function-hooks

See also the defintion of summary() in fastai.


(Devesh Maheshwari) #17

Anyone looking for Autoencoder in TF, I wrote a kernel long back to construct features using AE, but the improvement was not very impressive so I left that approach there only. Denoising AEs are different in the sense that we feed noised data in them and they are still able to learn the denoised construction of it.
Here is the kernel post.
https://www.kaggle.com/devm2024/feature-construction-using-autoencoder-tf


(Kerem Turgutlu) #18

Thanks for the reply, I am actually changing my code right know I will try to code a Stacked AutoEncoder class, which will allow data augmentation eventually. So the user will first define layers as [20, 100, 100] let’s say 20 is for input dim and 100s are for 2 layer’s of features to be learnt. Class will allow freezing so that once first layer is trained by the user they can basically call ae.train_next() and start training next feature layers by using previous one as new inputs. Augmentation will happen to each input with a given probability p and using the input swap technique.

I think this is a better and the right way to do.


(Jeremy Howard) #19

The existing columnar learner in fastai does nearly all of that already FYI. You can just use freeze_to.

Does the winning entry actually train it in this stacked fashion though? I haven’t seen anyone really use this approach for years, and I thought it wasn’t helpful with modern DL. But I’m curious if the winning entry actually found otherwise…


(Kerem Turgutlu) #20

Hmm, actually Michael doesn’t mention how he train but just says DAE with 221 1500 1500 1500 221 with linear activations. I did research and this was the method I could find with a good explanation so I though this was it since it made kind of sense.

How should I train it? Let’s say given the same architecture with 3x1500 linear activations.

Or, Should I just denoise the input then train the network 221 1500 1500 1500 221 by backprop until validation MSE is good enough. Then doing forward pass I can get activations to create the new dataset?

This is how he explains it:

One can use train+test features to build the DAE. The larger the testset, the better :slight_smile: An autoencoder tries to reconstruct the inputs features. So features = targets. Linear output layer. Minimize MSE. A denoising autoencoder tries to reconstruct the noisy version of the features. It tries to find some representation of the data to better reconstruct the clean one.

I recommend linear activation in the middle layer of bottleneck setup because relu truncate the values <0. Yes just concat to a long feature vector. Here for a deep stack DAE 221-1500-1500-1500-221 you get new dataset with 4500 features.

It also sounds like it is as easy as running a fully connected network without any freezing…

Thanks