Incorrect "test" prediction for 1 file

I’ve trained classifier from lesson1 to distinguish if document has a stamp. Accuracy is 1.0. When I put 1 file to test dir and call prediction I see incorrect prediction, but if I add some other doc (or some docs) to test, then I get correct predictions for all the files and even for one which was classified previously incorrectly.

How amount of test images can affect correctness of result? Thanks!

Following as I am having this issue with a snake classifier as well!