Classifiers and a "Don't know" category

You should look at MC Dropout: MC Dropout and practical ideas for it. It is super easy to implement, you just let dropout enabled at inference time and do several predictons.

This would give you a distribution of predictions. So you can then see how confident in its prediction the neural network is. Then you can decide on a threshold of confidence where you output “Don’t know”.

I would also recommend thiis video from Vincent Warmerdam How to Constrain Artificial Stupidity which cover this type of thing.