Feature extraction not of sufficient quality


I am trying to replicate what a lot of businesses are doing with image search. In particular one business that uses deep learning to extract features from images and then calculates the K nearest neighbors for a query image.
What I have done is for a dataset of 4000 images calculate SIFT descriptors. There are about 1500 * 4000 SIFT descriptors in total. I indexed these In a system called NSG which is almost the same as FLANN but a lot faster and more accurate. I also indexed the 4000 images in the algorithm the business uses. When comparing results the businesses algorithm is doing a much better job at recognizing similar images.

My first thought was that, because they are use deep learning features they are better. So I extracted features from Resnet50 by not including the top layer. This resulted in a 772048 vector which I reshaped to 492048 and also to 1100352. I tested both these types of features with the same NSG approximate nearest neighbor distace calculater but again I did not get as good a result the business gets.

What are some other things I can try to get similar results to algorithms that are used in production? What is something I am missing here?

The dataset I am using consists of 4000 images of mostly bags and suitcases