Attention Models in Computer Vision

So having used and covered the attention models for NLP and sequences in the course, it seems like an attention model would work well in imaging problems as well. There are some old papers (which Enlitic tech staff said were useless for anything but toy problems) with ideas like Spatial Transform Networks.

I was wondering if anyone had any feeling for what has been done with a more simple hard or soft attention model which is quite easy to implement fully-convolutional networks. Particularly for pre-trained models, this seems to avoid some of the issues mentioned with STN’s because you aren’t trying to simultaneously improve a resampling (which drastically affects the input to the classification) and the classification task.

My test dataset for this problem (inspired by the winners who actually manually labeled and pre-trained a UNET to preprocess the images and select ROIs) is the Bone Age from X-Ray dataset and my first notebook is here

1 Like

You can have a look at
Attention-based Extraction of Structured Information from Street View Imagery

They use attention on images to do OCR on streetview images via an LSTM.


This is interesting, Where can I read those Enlitic tech staff opinions?

1 Like

Were attention models also covered in version2? If I remember correctly, Jeremy didn’t go in depth about attention models in version 1.