Hi Hazard I hope hall is well!
I agree with the comments posted so far and would add that, your images are far less distinctive than say types of cat, so you will need say 50-200 maybe more images per class this would be a good start, these should be as distinctive as possible. The more distinctive the classes generally the less data you need.
Your confusion matrix is clearly saying that all five of your classes are similar to class 4 and the 3rd class is the most similar of all. (maybe overfiitting).
Having built approximately 70+ image classifer apps, the most difficult was a wristwatch classifier, it worked well for 2 classes but I couldn’t get it to work for 70 classes, because unless you read the name on the dial many watches look very similar.
Heat maps pose some problems that images don’t, here are some links that may help generate some ideas.
Also segmentation seems to be used a lot with heat maps.