Due to Tensorflow non-deterministic GPU execution, I get different accuracy (F1 scores) for each time running the same model, same hyper-parameters and same input data. Should I choose the model with the maximum F1 score on the validation set after running the model for several times?
What are you choosing it for? Doing what you suggest introduces an element where you’re, in essence, fitting to the validation dataset. This will probably not be what you want, in most cases.