/ Pytorch question: Need help understanding weight tying in relation to encoders and targets

I’m working on reproducing the Porto Seguro winning solution of a denoising autoencoder on a different dataset. Since my categorical inputs are quite large (30K+) I was hoping to take advantage of embeddings in the input in the same way we did in Part 1 Lesson 3 Rossman.

In order to help regularize and help with convergence I’m also working on tying the weights of the encoder to those of the decoder like Bengio suggests in his first denoising autoencoder paper. There’s an example of weight tying in the language model which I understand reasonably well, and it seems very straightforward.

What I’m trying to understand by going through the code is how the target categories are supposed to get converted. I can’t see it happening anywhere in or To my understanding the dataloaders are passing the category code that indexes the embedding for the target word, but I don’t see the actual embedding being called anywhere.

How/where does the target get transformed? Does the crit function (ostensibly F.cross_entropy) take the index of a word and compare that to the output of the model? It seems like there’s some sort of implicit conversion to a one-hot encoding going on here, but I don’t see where that’s happening in the code.

And in my case since I’m working with several embeddings and trying to train and autoencoder, my feeling is that I should instead be targeting the RMSE of the model to the embedding outputs themselves rather than the softmax outputs. Is that correct, or do I need to come up with a crit that encapsulates the cross entropy of all of the embeddings as well as the continuous variables?

Finally, if i’m targeting the output of the embeddings where do I specify that the target is an index of those embeddings?

As a fallback I know I can just one hot encode my categoricals but I’d really like to take advantage of input embeddings.

Yes that’s exactly how that loss function works! :slight_smile: Check out the pytorch docs on CrossEntropy, and also take a look at the single object classifier we built in the last class - you’ll see that the y values are plain ints, and the model outputs are per-class probabilities. This is what the loss function expects.

Interesting, that’s a really clever way to do it. Now that I look at the function it’s pretty straightforward, but that’s not how I would have expected it to work.

Any thoughts on the autoencoder with multiple categories? I’m guessing i’ll have to come up with a loss function that equally weights the cross entropy of each of the categoricals and balance that with the continuous variables evenly. That should be doable though, assuming I can come up with some reasonable relative weighting between categories and conts.

Have you heard of any architectures that directly target the embeddings? The only way I can think to do that would be to pass both the input and the (detached) target back through the forward pass as the input and have a loss function that ignores the empty target and instead splits the input before comparing. Does that sound feasible?

Thanks as always for your input and help!

Your loss function will simply have multiple softmaxes, one per categorical (and none for continuous, of course). And yes you’ll need to weight the cross-entropy and mse bits to make them similar.

@kcturgutlu has worked on implementing this solution - check his post history since I think he posted a link to an implementation…

1 Like

I’ve been using his codebase as a starting point but there are a number of differences. In all of the examples of embeddings I can find in his repository the crit is torch.nn.NLLLoss. As far as I can tell he directly predicted the class of whether a driver was safe or not using the mixed data method as a point of comparison. The autoencoder implementation uses one hot encoded input with a MSE crit.

I’m hoping to combine the two ideas so that the DAE takes mixed input data but as far as I can tell his implementation doesn’t do that. I want to compare it’s performance to the one hot encoded input, both in terms of the ability to reproduce as well as the effectiveness as a pretrained feature embedding in a recommender.

Hi Even,

I would like to help you out with your project and even work with you on it if you allow me to. First, denoising auto-encoder project failed me in the sense of replicating Michael’s (Porto Seguro Winner) solution and the exact solution manual seems to be gone from discussions so I’ve left it as it is. But implementation/code should be solid or at least the theory is easy enough to implement it once again with caution. By the way don’t worry about the embeddings part since it’s very easy to implement even if my repo is missing it.

“The idea behind denoising autoencoders is simple. In order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it.” from

Basic idea is to denoise X with data augmentation to X_tilde then learn the mapping from X_tilde to X. This process supposedly gives you a better representation of your data. Here, X is just cont + cats, weights are learned optimizing mse since it make sense while computing loss(X, generated X). Also by normalizing each and every input to (0,1) before training we avoid scale issues and thus not weighting a particular feature too much during backprop.

I missed working on structured deep learning, I’ve been doing images for a long time now, so if you are interested we can properly implement the idea in a more general scope (allowing multiple augmentations, normalizations, corruptions) write a blog post about it and publish it for automated feature engineering on top of fastai and/or pytorch.


I’ve got to get to bed but let’s chat more on this later. I’ve got a working version of most components, but your code was a big help in getting there and it would be helpful to have someone to walk through it and point out any concerns.

I’d certainly be interested in taking our combined efforts and applying it to an integration into the fastai library and a blog post. My code is close to the library in terms of style but it’s not exactly the same. I’d like to get a working version going first though.

I’m going to take @jeremy’s advice and work on the loss function. Other than that it’s mostly piping, although the weight tying might be a little tricky as well but at least there are examples there. I was stuck when I couldn’t understand how encoders were evaluating the loss from and index but knowing that it happens directly in the loss function has unblocked me.

Sounds good, talk to you later :wink:

I’ve made some reasonable progress on updating my implementation. So far I have a working dataset, dataloader, and modeldata, and the model is close.

I want to try to confirm my understanding of the weight tying though. Based on the pytorch language model example and the library it seems fairly straightforward in terms of how to tie weights, but I want to articulate what I think is going on both for my benefit and for others.

The weights from the embedding are stored as a lookup because their inputs are one hot encoded and the output of the one hot encoded layer multiplied by the weights is identical to the values for the weights themselves.

When it comes to the inverse linear decoder layer, these shared weights are used to produce a one hot encoding because that’s what we’re targeting. The network is trying to learn what inputs to that final layer, when multiplied by the shared weights, produces the desired one hot encoding.

From what I’ve read this isn’t necessary, but has a regularizing effect on the network, and seems to work well in language modelling.

Is this understanding correct? Any pointers or thoughts?

Your description sounds about right to me. When you say “this isn’t necessary”, what are you referring to as “this”? Are you referring to weight tying there?

Yeah, that’s what I was referring to.

The aspect I found challenging to understand but which I think I’ve figured out regarding weight tying in symmetrical autoencoder networks is whether the weights need to be transposed or not. In my research I’ve seen a reference to using the transpose of the weights for the linear layers, which makes some sense to me because we’re inverting the shape of the linear layers, and most of the examples provided ( / awd-lstm) change the shape of the output tensor, but that’s not the same as a transpose.

I’m still not certain about those view calls in LinearDecoder (and the awd-lstm source):

    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(-1, decoded.size(1))

A transpose isn’t necessary for the embedding -> decoder mapping because the tensors in both cases are the same shape. I.E. an embedding of K categories into an N dimensional vector has a weight tensor of shape KxN, and the corresponding decoder is a linear layer from N to K, which also has a weight tensor of shape KxN.

I suspect the view here has to do with Mini-batching, but I need to walk through step by step what it’s doing because right now I don’t understand the reason for it.

Anyway thanks for the help.

Yes exactly - it’s rather convenient!