Exploding gradient problem in residual networks

I am training a network to detect vehicles in point clouds. The network backbone has residual blocks, with a couple of upsampling layers forming feature pyramid layers, and finally layers to perform classification and regression. The network is a fully connected convolutional NN. The network output is dense, about 200x175 predictions, and I am using focal loss for classification and smooth_l1 loss for regression; formula being Total_loss = (1/m) * (sum(cla_loss)+sum(reg_loss)) m is number of samples per batch. I am initializing the weights with kaiming_initialization method. During the network training, initially the loss is quite high in thousands, it reduces after some iterations. The loss starts high, reduces, and eventually goes to zero. This makes the gradients to be quite high. I am following this paper: [paper].
Currently, I am employing gradient norm clipping (norm=1).

  1. Are there any other methods to cure gradient explosion?
  2. Is gradient explosion caused due to summation(instead of mean) in the loss function(output is dense)?
  3. Can we initialize the convolution layer weights with an identity matrix? (Don’t have a function to do so gracefully in pytorch)
    P.S: I am using SGD with momentum=0.9, pytorch, KITTI_BEV dataset, input is a sparse 3D occupancy grid(BEV representation of point cloud), batchnorm.
1 Like