Pretrained image model that allows 'large' images?

muff · May 23, 2023, 9:24pm

hi,
the usual size of images fed into image classification models is either 192x192 or 224x224 pixels as far as i know.

are there any pretrained models that allow larger images than that? ideally 512x512 or even 768x768?

thanks

muff · May 24, 2023, 3:53am

oh, and:
am I right that I can fine-tune an image model that was trained for 224x224 pixel images on higher resolution images?
what’s the limitation usually? at what resolution would the image model start to have difficulties? or is the only limitation the GPU vram?

BobMcDear · May 24, 2023, 1:45pm

Hello,

Many convolutional neural networks such as ResNet can, in theory, accept inputs of any size, regardless of the resolution they have been trained on. However, one should be cognizant of the gap between the training and test resolutions nonetheless, for the model’s performance would drop if it is too substantial. For instance, a network trained on 224 x 224 inputs cannot be expected to be accurate when fed, e.g., 1024 x 1024 images, but it would likely have no problem with, say, 320 x 320 ones (in fact, research by Touvron et al demonstrates slightly increasing the resolution during inference actually boosts score on classification tasks).

Vision transformers (ViTs), on the other hand, are a different story: One of their key ingredients is positional encoding, which typically assumes static input dimensions and cannot handle variable resolutions. The most common solution is to interpolate the position embeddings to the desired image size and fine-tune thereafter.

Like I have mentioned, ViTs aside, most models do not impose any hard constraints on their input size, but if you are seeking image models with higher training resolutions, I suggest looking into timm - some models such as tf_efficientnet_l2.ns_jft_in1k have been trained on images as large as 800 x 800.

Yes, you can. If your target resolution is much greater than the training resolution, it may be worthwhile to split fine-tuning into multiple stages and gradually raise the resolution after each phase (e.g., fine-tune for some epochs at 384 x 384, then at 480 x 480, and so forth).

Memory usage rapidly rises with respect to increases in the input resolution, making training at very high resolutions infeasible. There are also papers such as EfficientNetV2: Smaller Models and Faster Training and Revisiting ResNets: Improved Training and Scaling Strategies that empirically prove increasing resolution has saturating gains, and going beyond 420 x 420 is generally advised against. Additionally, EfficientNet revealed that width, depth, and resolution should be scaled in conjunction, meaning if your network is small, training on large resolutions would lead to subpar performance.