How to map embeddings from image space to text space

Hello all. I’ve been struggling to successfully map some 2048D list image embeddings that were extracted out of a resnet50 to a 300D list of fasttext embeddings. This is similar to what was shown in the devise implementation shown here. I was able to sucessfully do it on imagenette using fastai2 but on my custom dataset, the similarity between the predicted embeddings from the model and the actual embeddings just refuses to increase. From my understanding, a well trained model is expected to yeild a similarity of >90% ish on the valid set

Things I have tried:

  • I have used a 2 layer neural net with 25% dropout and an lr of 1e-2 which was giotten from the lr_finder. This yeilded a similarity of 56% ish and was stuck on it even after training for 10 epochs. The optimizer func used was Adam with decoupled_wd. I also used one_cycle policy learning which almost always works.
  • I used the same setup and network as above but with an adam optimizer without decoupled_wd. I got stuck at 58% ish
  • I used a 3 layer neural net with leaky relu activation function and tried using the same setups as shown above. This increased the similarity to 64% and it got stuck there over about 10 epochs.
  • I repeated the 3 steps above without a scheduler(using just fit) and with the sgdr scheduler but there was no improvement.

What surprises me however is that the exact same model that showed no improvement on the 2048D image to 300D text mapping performed excellently on a 1568D text to 300D text mapping. I’m not sure how to solve this problem so I’m wondering if anybody has a solution or a suggestion.

I have tried to explain everything i have done but if more explanation is needed on my end, please do tell so i can add a bit more context. Thanks