Feature extraction from images for further classification

A lot of times I feel like I run into very simple image classification tasks with small datasets. Say that you want to distinguish between images of a single object on white background, and a regular photo, and only have 100-200 examples to work with, maybe less.

If you used a regular ML algorithm like an SVM, it completely ignores the spatial structure of the image … and it would also overfit on the large number of pixels and small number of images. Now you could use something like PCA to de-correlate the pixels, but last time I tried PCA overfitted as well.

Now an argument can be made for transfer learning, but given the simplicity of the task, it seems that one could just take the conv (feature extraction) part of something like resnet, and build an SVM on top of that. The problem here could be both that resnet expects a fixed image size (something we can fix), and that it might not have been train to see the types of images we have at all, even though they’re really simple. I understand that we can just pick layer higher up. An alternative approach could be to train a conv autoencoder on tons of random images, and then use the encoder as feature extractor.

Are there any projects that do this, that I can just use for general image feature extraction? It feels like this is something that could be done in a completely general way, where the user just supplies a simple classifier at the end (which I guess could also be a part of such tool). Any idea if such thing exists? Or any tips how to approach this problem in a different way?

edit: A somewhat related thought that just came to mind would be to do this as a clustering task, rather than classification. Where yet again we have a bunch of images we want to cluster, but clustering based on pixels will most likely not work, yet any form of feature extraction using convolutions would seem to make sense, right?

I can partly answer your question, as that this is what I did to classify a set of medical images. Started with alexnet, extracted the feature map from the network ( you chose somewhere around the last/deepest layers before the classification layer) and then ran it through a random forest. It gave me a way higher accuracy, which is really cool especially since AlexNet was trained on natural images so we are changing domain here. I think I can still improve it by training it first on my images and then extracting the features. Not sure though.