I want to apply the DeViSE approach to my own dataset alongside my ‘own’ words to get specific outputs (instead of using ImageNet and WordNet). In essence, I want to have an image - say a watch, and the output should be its specifications. In that case, I want to replace ImageNet pics with watch pics and WordNet words with spec words. Is it feasible?