I just started to experiment with HRNet and my initial results look quite promising. Compared to ResNet, I have to use smaller learning rates, but it still converges as fast as ResNet.
The core idea behind HRNet is to not only use a single resolution as ResNet, but to keep multiple resolutions, compute with them in parallel and share the information (fusion) across the different resolutions.
This architecture seems to be powerful enough that it can be used for classification and image segmentation while the only difference can be found in the last convolutions/computations.
Explanation by one of the authors (starts at around 8:00):
Classification, Object Detection, Semantic Segmentation, …