Share your work here ✅

swsaraf · March 5, 2019, 5:02am

Hello folks,
I attended fastai part 1(2019) Oct-Dec last year and I did a project Deep Visual Semantic Embedding Models for Mobile inspired from https://github.com/fastai/fastai/blob/master/courses/dl2/devise.ipynb using modified fastai 1.0.33.

My project involved using semantic information from word embeddings to augment visual deep learning models to correctly identify a large number of image classes including previously unseen classes.I used this work towards a course project at Stanford (AI Graduate Certificate) too.

Pets lessons (1 & 6) were super helpful in building the model and interpreting the results:

https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson1-pets.ipynb
https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-pets-more.ipynb

Github: https://github.com/swarna04/cs230
Stanford project report: http://cs230.stanford.edu/projects_fall_2018/reports/12449389.pdf

I presented this work at an internal Machine Learning conference at Adobe too and it was received very well. Thank you @jeremy and team for the amazing library and super useful docs.

Project Details
I used a lightweight mobile architecture SqueezeNet 1.1 to train a model with pretrained fastText word vectors to learn semantic relationships between image labels and map images into a rich semantic embedding space. The model takes in image input and provides a 300-D image feature vector output. Using efficient cross-platform similarity search library such as nmslib, the output feature vector can be used for image similarity search in model predictions, or for label prediction based on lookup of the nearest fastText word vector representation for the known image labels in dataset. The model can also be generalized for zero-shot use cases by performing nearest neighbor search in model predictions for the fastText word vector representation of the input text label.

This project observed that such visual-semantic models are able to perform image-to-image, image-to-text and text-to-image associations with reasonable accuracy while using less than 7% of disk space and training parameters as compared to bigger models such as Resnet34 (used as baseline).