Edit: I had a bug in my custom loss function. My network is indeed not learning anything. A bit of a bummer but at least it makes sense now.
I implemented a custom loss function that does the following for each minibatch:
- Finds the lowest to highest sort order of the predictions
- Applies this sort order to the response variable such that a copy of the response variable is created that is sorted in order as determined by the predictions. I will call this “prediction-sorted response variable” in subsequent steps.
- Takes the cumulative sum of the prediction-sorted response variable
- Normalizes the cumulative sum of the prediction-sorted response variable by dividing by the sum of the all the response such that the cumulative sum of the prediction-sorted response variables goes from 0.0 to 1.0.
- Now, we have a Lorenz curve from economic theory. So, I approximate the area under this Lorenz curve by summing up the successive difference in the cumulative sum of the prediction-sorted response variable.
Somewhat surprisingly to me, the network seems to learn something in the presence of this weird loss function.
But, I can’t wrap my head around is how the heck it does learn? I would think the sorting action would make it impossible for PyTorch to keep track of the gradients that each observation in the minibatch contributes to each of the network’s weights. Can anyone who understands backprop better than I give an explanation of how the network is learning? Or a test to confirm that it’s learning correctly?