I understand that it is something like x1w1+x2w2+…+xn*wn, where matrix of weights could be a ‘filter’ learned, but I don’t quite understand why this could be interpreted as probability of being an eight, or some ‘proxy for probability being an eight’?
If the w matrix used here would have high values in cells matching pixels that are black in images of “8” but white in other images, the sum of multiplying each pixel by its corresponding weight would be higher the more the image resembles an image of an “8”.
That is a big “if” here, and by just writing this function definition, we have done nothing to initialize w that way.
Before machine learning, one would manually initialize the values of w to behave that way. What we do in the machine learning era is initialize w randomly, and then iteratively refine it so that applying pr_eight on images of eight gets a high value while other images get a low value.
The name of the function might be slightly misleading - it is not the formula itself that makes this operation find images of “8” - it is the value of w (so that exact same formula with another w can classify something else).