You're right. Good points. No clue - have no experience with 3d / very large images. That paper just came out, i'm not aware of an implementation of it (if there is one, i would gladly try). I suggested it because of it's benchmarks, speed and i found their way of implementing attention and comparisons to other attentions to be useful. But you're right, have no clue regarding those. Their "Mixed attention" is a function of the pixels and channels so if i got it right, it should work.
How does Resnet work with such data? 3d + very large?