Sorry if this is obvious, but even after searching I found no definite answer on how this is handled exactly.
Does cnn_learner() automatically apply image transformations before classifying an image, i.e., before feeding it to the input layer? Does it do any automatic image transformations before using an image for training? Or does ImageDataBunch do any automatic transformations?
This assumes I use a simple code like this:
data = ImageDataBunch.from_folder(path, bs=16)
learn = cnn_learner(data, models.resnet18, pretrained=True, metrics=accuracy)
The image, that can be output as follows, appears to be unmodified in size and other parameters, but this simple test of course does not mean they get fed in unmodified into the neural net.
Training would be invoked like such (or with similar parameters):
learn.fit_one_cycle(20, 0.01)
And inferring like this:
learn.predict(img)
The documentation remains a little vague regarding the above questions, or maybe I missed where it’s explained. Could someone point out a definite answer to these questions?
3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224.
I wonder how the example here https://github.com/fastai/fastai/blob/master/examples/vision.ipynb can work, given that the images from MNIST are 1x28x28 (8 bit PNGs in the folder). There must be some automatic conversion, but I can’t find out anything about what it is in the documentation of fastai.
data = ImageDataBunch.from_folder(path, ds_tfms=(rand_pad(2, 28), []), bs=64)
data.normalize(imagenet_stats)
That means that the transformation rand_pad(2, 28) is applied to the training set, and no transformation is applied to the validation set. As far as I can tell from the code, normalization does not affect image size at all. Since the validation set is directly obtained from the 1x28x28 PNG images, somewhere a scaling has to be done as per the quoted requirement of the documentation, from: https://pytorch.org/docs/stable/torchvision/models.html
All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224.
The PNG images are too small to fit the required size of at least 224x224 and being RGB instead of single channel grayscale. Where does this transformation happen? I couldn’t find it.