By much better results you mean ability to reproduce MNIST images from your validation set? What cost function does one normally use for that - some distance measure on pixel values?

In general, we would use the ability of an autoencoder to reproduce the inputs as a way of gauging the quality of the middle layer activations?

I saw much better results than just noising the data and better than adding l1 regularization (this was used by the tutorials I have followed -> from the keras blog). Note - did not try batch norm.

Error - used the same as in the tutorials -> just MSE on the reproductions. And yes, ābetter resultsā meant less error (perhaps subjective, but also looked better visually at similar error levels)

My interpretation is that the ability to closely reproduce inputs implies that the ae has been able to learn / extract latent structure(s) in the data (which should correlate to potential usefulness of the encodings).

As a side note, for a lot of deep learning based solutions, it makes sense that aeās have not retained popularity - you might expect the network you are using to extract this structure anyways (I think Jeremy commented or alluded to this elsewhere). In the structured data case weāre discussing here, perhaps we can look at the ae as doing something analogous to embeddings -> extracting a rich feature representation.

Should someone find this thread sometime down the roadā¦ here is a really neat paper that provides a nice overview of autoencoders and their applicability.

Iām taking a look at your github as this is of interest to me right now and one thing I did notice was that the winning solution wasnāt entirely linear. Only the middle layer was.

Iām curious about how the embedding model worked. That was my first instinct as well, especially since some of my categorical variables are huge categories.

Are you going to be in part II this spring? If so maybe we can work on this together.

I will be taking Part 2 starting next week as well. DAE should have relu activation except for middle layer where middle layer has linear activation. I donāt know what is the main motivation behind this. If this is not the case in my class definition I should check that. Embedding model as it is didnāt work much better than Xgb models shared in kaggle.

I think I have a working version of Porto that uses the structured data embeddings. The only component missing is how to weigh the categorical loss with the continuous.

Iām using cross_entropy per categorical, and MSE for the continuous. Iāve tried as a starting point weighting them by the per variable loss, so the MSE/cont = CE/cat but that feels like a naive approach and is unlikely to work since the two functions are so different.

@jeremy do you have any ideas / experience with this? Iām wondering if @rachel or one of the other math wizards can provide a theoretical framework. The only discussion online about this seems to be about comparing apples to oranges and how you shouldnāt do it.

Iāve played around with a few, but the one Iām settling on for now is MSE + MCE (Mean Cross Entropy), which is at least consistent across models of varying sizes.

I tried some other metrics based on balancing the error between the continuous and categorical elements, but it was hard to interpret the loss.

So far Iām able to train the model to a validation loss of 0.51 for normally distributed continuous variables with a stdev of 1, which I think is good, but iāll need to compare the outputs.

Itās still underfitting slightly, which may be the result of using both swap column data augmentation and dropout, so Iāll have to explore lowering the dropout. Eventually I need to have an ablation study.

The other thing Iād like to do is compare it to the original VAE which outputs 1-hot encodings and uses MSE for the entire output vector. I think to do so I just need to modify the loss such that the categoricals are output in that form. Iām curious to see if explicit category embeddings and cross entropy loss help make a better fitting model.

Itās probably been a while since you touched DAEsā¦

I found your github useful to get started. Is your latest version in DAE.py or in PortoSeguro.ipynb?

Did you end up using the activations from the DAE to train a Porto Seguro model?

I am considering a slightly different approach to applying the technique with fastai:

ColumnarModelData and ColumnarDataset can easily be modified to accept tfms

inputSwapNoise is turned into a class with parameter p and call method

Next I was thinking that, instead of saving activations, we could save the first few layers of the autoencoder (up until the activations we are interested in); and reuse them in our final model after freezing them.

This is all stuff thatās available in fastai for image data, and apparently not for columnar data.

Even though this result quite old, I thought I might try to implement it (I can see that Iām not the first) Naturally I have lot of questions.

Michael Jahrer did a great writeup about this but I need some help to understand the details

I have 3 questions and I hope you more experienced people could offer answers

The idea seems to be to train a model (using a denoising auto encoder) and then use the activations of the hidden layers as features. This is very similar to using an auto encoder to reduce dimensions but here the hidden layers are big so the dimensions go up.

Q1
step 1 is to use data augmentation to get noisy inputs by replaced p percent of each input value with a value from another row. Now, if a given column has only a few values and they are not equally frequent there is a pretty good chance the replacement value is the same as the original. Is the p percent the number of original values that get replaced or the number of change?
Q2
Normalization, Michel Jahrer normalized the data using a thing called rankgauss this does not just move the mean to 0 and the sd to 1, it also reshapes the distribution to be more bell shaped - I have not seen any papers talking about that. Why is this a good idea and when should you use it?

Q3 Finally Jahrer use 5 fold cv, which means training the model 5 times each time using 80 of the data for training and 20% for validation and then averaging the 5 models predictions.
I didnāt see anything about this in the fastai class and as best as I understand cross validation you need the models to be good at different things. How do you know when using more models each trained on a subset of the data would be better than a single model trained on the entire data set

Hope these questions are not so simple that I should have figured them out my self