I haven’t tested it out yet, but this link provides some interesting hints, especially also on adapting models with layers before. Also the image size used is much smaller:
Checking this link however, that says it’s slow to train, makes me assume it might be slow in inference as well:
The following is on intuition for choosing the right image size wrt. to recognition rate:
So in my case small images with low noise (maybe using some Gaussian filter) should work well, since I have pretty distinct lanes delimited by white paper.
Compression from the video streaming could be an issue, and that compression noise should be eliminated as much as possible. Flickering or changing brightness may be an issue as well. So I will have to find a way to test how well the images were stabilized and if a Gaussian filter is sufficient to filter out “grainy” images and compression artifacts.
Probably filtering is best applied before training, and before feeding it into the trained model (i.e. inferencing).
Also, it seems the input size for VGG16 is made for 224 x 224 RGB images, so no improvement there compared to resnet18.
A good discussion about many relevant details is here:
My understanding:
So CNNs will deal with any image size, as long as the kernel(s) can be applied (i.e., they should be at least as large as the kernel’s resolution and channel depth) and you have “summarizing” layers later, that can deal with any number of inputs.
Still have to see if fastai does any implicit resizing of input images.
The accuracy might suffer if image details/features are lacking that the pretrained model tries to extract. The question is if this could also cause misidentification of features that do not exist in the input image at all, or if it will simply not detect those features for which the details are lacking (which would be fine, since I only want to detect details that are clearly visible even at low resolution, i.e., the lane boundaries).
So image input size would affect the performance of a forward pass, even with an unchanged CNN model, because the kernel will slide across the entire image, and how many slide “steps” it takes depends on the image size.