Unets: Trying to understand the maths

I’m trying to get used to understand papers by reading the one on Unets. But I’m having hard time with the maths.
On the Training section:

. I’m trying to understand what this a_k(x) actually is and why do they say that x belongs to sigma and sigma belongs to the integer domain squared. Is it relevant for the implementation? Does someone has a better reformulation of this sentence: “`a_k(x) denotes the activation in feature channel k at the pixel position x ∈ Ω with Ω ⊂ Z^2”? That would help me a LOT!
Thank you.

1 Like

You have k possible labels for each pixel in your M x N size image.

Each point x, i.e. {x0,x1}, where x0 is in [0,M) and x1 is in [0,N), or in other words x is in [0,M) x [0,N) (cartesian product) which is a subset of Z^2 (apologies for butchering mathematical notation on my phone).

So you’re calculating a softmax over all your channels for each pixel in the image. And your loss is crossentropy with ground truth segmentation for your image. Each a_k(x) is that channel’s output for that pixel before the softmax layer.

Not sure if that helps.

1 Like

The x ∈ Ω with Ω ⊂ Z^2 stuff is math formalism that you can ignore. It just says that pixel positions are integers and x is a vector of such pixel positions.

A feature map is a 2-dimensional (grayscale) image that is the output of a single filter (also known as a channel) from a convolutional layer. So if your Conv2D layer has 32 filters then it produces 32 feature maps, and each of these feature maps has the same width and height. (Exactly how wide and tall they are depends on the previous layers etc. This is explained in part 1 of the course.)

The notation a_k(x) just says: the value in feature map k at the pixel coordinates given by x (which is a vector with 2 elements).

The softmax function they define here works just like the softmax you’ve seen in part 1, except that it is done for each pixel.

In a normal classifier you end up with a dense layer that outputs a vector of, say, 20 neurons and you take the softmax over those to get a prediction for each of the 20 corresponding classes. Here, however, you don’t want to make a single prediction for the entire image, but a prediction for every single pixel in the image.

So if k = 20 (there are 20 classes) then for pixel (0,0) you do a softmax over all 20 feature maps at coordinates (0,0); for pixel (0,1) you do a softmax over all 20 feature maps at coordinates (0,1); for pixel (0,2)… and so on.


Thanks a lot for your help guys! Really appreciated!

So it means that Z^2 is actually the domain of vector integers? I do not quite understand why Z^2 instead of Z.

So given both your explanations. Lets say my neural network gives as output 1 feature map of size 388x388. Am I right to say:

(excuse me for the image but this forum doesn’t seems to accept LateX syntax).
Thanks a lot again guys.

If a is an integer number, the domain of a is Z. If x is a vector of two integers, the domain of x is Z*Z or Z^2. If x is a vector of three integers, the domain is Z^3, etc.

It’s just a way to notate the size of the vector. Instead of saying, “x is a vector with 3 integer elements”, you can say, “x in Z^3”. It’s just math shorthand. :wink:

[Likewise, if X is an NM matrix of integers you could say "X in Z^(NM)".]

1 Like

Thank you so much!!

Another little question regarding this. Why did they say “x ∈ Ω with Ω ⊂ Z^2” instead of just saying " x ∈ Z^2" or “x ⊂ Z^2”? I’m sorry that may seems to be a stupid question but I really want to know the details.

Another thing, why would x be of size 2? Shouldn’t the size of x depends on the count of the k feature maps?

x can’t be in all of Z^2, it can only be within your image size (a subset - don’t know how to make an omega symbol on here).

You end up getting an output that is num_pixels * num_channels, but you can look at that as a different output for each pixel instead (since you’re applying softmax independently to each one).

1 Like

Thanks! It’s a lot clearer now :slight_smile:

Even after your explanations guys I’m still trying to implement this “energy function” in pytorch with no success.
Here is my code so far:

class EnergyFunction(nn.Module):
    def __init__(self):
        super(EnergyFunction, self).__init__()

    def forward(self, logits, targets):
        num = targets.size(0)
        A = logits.view(num, -1)  # Flatten
        B = targets.view(num, -1)  # Flatten
        C = []
        for i in range(num):
            C[i] = F.softmax([A[i], B[i]])


I really don’t know what to do or even what to search for on Google. I believe I need more training in linear algebra…

I’m not sure why you’re trying to take the softmax over the logits and the targets. Wouldn’t you just take the softmax over the logits and then do a cross-entropy between that softmax and the targets?

1 Like

My neural network outputs 1 feature maps (instead of 2 from the original paper) so I wanted to experiment with 2 feature maps. But you’re right, the function should take into account the 2 output feature channels of the neural network instead. I modified my neural net architecture so that it now outputs 2 feature channels.
Following your instructions:

 class EnergyFunction(nn.Module):
    def __init__(self):
        super(EnergyFunction, self).__init__()

    def forward(self, logits):
        # logits is of shape [batch_size, feature_channels, height, width]
        # Here: [4, 2, 388, 388]
        r = F.softmax(logits)
        # r is of shape: [4, 2, 388, 388]
        return r

But here r should be of shape [4, 388, 388] or [batch_size, height, width] right? Thing is. I really don’t know how to pass in to the F.softmax() function the pair of pixels from the 2 feature maps :disappointed: .
Thanks again.

I have zero experience with PyTorch, but softmax() usually works across the entire thing. Here, you want to make it work across the feature_channels axis only.

1 Like