Resize() for item transforms- values other than 224?

Hey y’all, I’m currently reading through the Vision Tutorial, and I see that initially a value of 224 was passed to the “Resize()” transform. Then later, we decide to pass a larger value of 460. I’m guessing the trade-off of passing a larger value is that the images will be higher resolution (and therefore contain more data which will help the model predict more accurately), but that this will make training and prediction slower as a result.

I have a few questions:

  1. Why 224 and 460? Are these standard numbers that have proven to be useful by many users over time?

  2. Are there other standard numbers that I should know about as I’m trying different hyper-parameters?

  3. How can I determine the dimensions of the largest image in my dataset, in case I want to push this hyper-parameter to the max and go as large as I possibly can?

Yes, the larger an image, the more fine-grained patterns it contains, thus the more accurate one’s model would be. Beware though, as there is a diminishing return, and it is better to scale both the input resolution and model size than either one alone.

224, I believe, comes from the AlexNet paper, which cropped 224 X 224 patches out of 256 X 256 images as a form of data augmentation (there are more details in the paper. I highly suggest reading it.). To the best of my knowledge, there is no specific reason they settled with 224 other than that it offered a compromise between memory/time and accuracy.

Other popular dimensions, like 460 X 460, have the same story: There is no rationale behind them per se, except that for most datasets (most being the operative word, since some applications, like medical imaging, often requires very high-resolution images, whereas in other cases, like digit recognition, you can get away with 28 X 28 images), they’re the sweet spot between not too large as to make training them a pain, but big enough to contain valuable information. A few are 96 X 96, 128 X 128, 156 X 156, 160 X 160, 192 X 192, 224 X 224, 256 X 256, 288 X 288, 300 X 300, 396 X 396, 460 X 460, 512 X 512, 800 X 800, 1024 X 1024, and so on (yet another reason for these particular values, which are all multiples of 32, is that powers of two are extremely common in deep learning and computer science in general.).

Finding the maximum dimensions in your dataset can be done easily with ImageMagick, but please bear in mind you should test out smaller sizes first to ensure you’re not wasting computation time & power for a 0.001% increase in score.

Have a great weekend!

3 Likes

Super-clear answer, thank you!

1 Like

And for why is that a thing, powers of 2 is much faster to compute than any other numbers, because they can fill all the cores properly :smiley: (On our GPUS)

3 Likes

Good point @muellerzr . :+1:

On par with BobMcDear’s post I believe the actual dimensions are more convention, and the whole “power of 2” performance argument seems more theoretical than observable. Hence my understanding is that these are like any other parameter that you could in theory optimize for your given model and hardware. Given the vast amount of other variables that can be optimized, it might for practical reasons be preferable to limit oneself to the commonly used power of 2s.

For the curious here’s a few good reads related to batch size and the power of 2.