Student Project: Deep Reader

This is the start of a thread for a project I’d like to embark on.

I’d like to create a data product that enables a user’s finger to translate text on the page of a book to speech in real-time; think of a child double tapping a word on the page of book and having it read aloud. I believe this is possible through combining methods in scene text detection, action detection, and text-to-speech. If we can build this, we could create an augmented educational experience that removes screens from the process of learning to read. The dream is to develop the deep learning algorithms to do this with the library, embed the models on edge devices with GPU’s and cameras (think along the lines of Nvidia’s Jetson Nano [1]), and let them loose on the world. Given the current state and coming future of edge devices, I can foresee equipping the most high-need schools with these “deep readers” at a relatively low cost. [1]

Note: I have never embarked on a complex computer vision project and am hoping to get feedback on the idea as well as any pointers on where to start.

My first goals are going to be implementations of proof of concepts with scene text detection and action detection separately (while using the library).

Any and all comments are welcome as well as any skill level of collaborators; I’d consider myself an intermediate coder and novice/intermediate level of machine learning practitioner.


You can use your camera on your phone to do something similar with google translate:

Hey @baz, thanks for that. I think there are a few programs/apps that can do something like this which tells me it’s in the realm of possibility but the idea with what I want to create is that you wouldn’t have to use a phone or computer screen; just have a book light with a camera pointing down to the pages and use an edge device to handle the magic… at least that’s the idea.

I’m working on a POC for action detection now; hoping to develop a blog about it by the weekend.