# Any sources on tensor transformations dealing with batch?

I’m working on a simple attention model and I’m trying to optimize the math, but dealing with batch has been very challenging. I’m planning to sit down and just muck about with permute, view, and transpose to make sure I understand them but I’m wondering if anyone has any good sources for visualizing this sort of thing?

There are not many materials online on this it seems. Here is what I found particularly useful - maybe some of this might be of help.

Part 2 lecture / notebook on predicting bounding boxes has a lot of goodies, especially the notebook. Working through the code there makes one start thinking in terms of operations on multidimensional arrays, how the IOU between two sets of boxes is calculated, how the boxes are matched, etc. It is quite hardcore though and working through it was a mini project in its own right for me.

Taking a closer look at broadcasting is also helpful. It’s something that we take for granted since it just works, but for more complex applications, I found a refresher on how it works and what it does quite helpful. Here is a great resource on this.

Other than that, it was searching google and usually endig up on the PyTorch forums for an answer. Plenty of very nice information there with some good explanations.

This is quite tangential, but what I found very, very helpful to this was using the `set_trace()`. I ran the training and upon hitting forward for the first time it would bring me to the debugger. I find this is super crucial and would be my go to methodology for tackling such problems. Recently I have been building a much simpler model, but I find this so super useful that I started with an empty forward method just containing the `set_trace()` and went from there, switching between trying things in the debugger and moving the code to the body of the method.

Thanks for the response. I’ve dug a little on the pytorch forums but haven’t really found any great explanations. I think ultimately I need to spend some time mucking about myself to gain a better intuitive understanding.

set_trace() is definitely my go to for analysing the net and making sure that the variables and sizes line up.

I think in this case the way I am stacking the weights via cat onto dimension 0 is making the transform more complex. What I want to do is really simple; batch aware matrix multiply of the weights by their transpose to get their inner product.

As usual I’m overthinking/researching when I really should just be diving into the data. Anyway, thanks for your help. I’ll take another look at that Part 2 notebook.

Hey Even,

I am not sure if this is helping and if not please tell me and I will STFU I am not exactly sure what sort of thing you are doing with attention - seems like it might be something complex. But I was thinking - and maybe that is just a case of adopting a different naming convention but maybe it will be of help - but:

I can’t check right now but I think that the weights tensor is always of dimensionality as if it was to be applied to a single example? PyTorch does whatever magic it needs to do to scale it to the batch of examples (likely via broadcasting). I am not sure if I am reading this right put possibly what you are referring to are activations of a specific batch size that you might want to do something to? A small difference but if that is the case adopting the name ‘activations’ might help with googling this.

I am not sure if this is attention for a CNN or for an RNN - if for the latter than there is the added complexity of how PyTorch batches examples by default (examples length before batch size in dimensions or something like that).

Anyhow - best of luck. Sorry if I am not being helpful. I am not great with any of this either, but if you feel bouncing ideas of me might be worth your while, happy to try to help.

Best of luck with giving your model the ability to pay attention!

My bad; I was writing this as my 3y/o was waking up from his nap and didn’t check before posting. You’re right about the weights, and it’s the wrong term. It should definitely read activations or layer outputs. What I meant to say was:

batch aware MM of the layer outputs by their transpose.

Right now I’m stacking those outputs using concatenation, so the 64x1500d outputs became 320x1500 because I was concatenating along d0. I think i need to use d1 and then use view to get a 64x1500x5 matrix. But I need to get clearer on how transpose works with matrices.

Thanks for the help, I appreciate the conversation.

Or you can unsqueeze your outputs on the dim 2 then concatenate them on this dim, that way you won’t have to transpose.

Finally got around to implementing this. Unsqueeze was the magic sauce I was looking for. Thanks for the help

Also, bmm is amazing.