Using custom transformers in image classification

Hey guys,
I was wondering is it possible to apply a tranformer such as fashion clips for image classification?

from transformers import AutoProcessor, AutoModelForZeroShotImageClassification
processor = AutoProcessor.from_pretrained(“patrickjohncyh/fashion-clip”)
model = AutoModelForZeroShotImageClassification.from_pretrained(“patrickjohncyh/fashion-clip”)

is there any way to incorporate this to get better results?

I would like to use a custom transformer with Mobilenet or yolo classification models to achieve better results while staying light weight. I am currently using resnet for image classification but the models are around 40mbs.