This is the start of a thread for a project I’d like to embark on.
I’d like to create a data product that enables a user’s finger to translate text on the page of a book to speech in real-time; think of a child double tapping a word on the page of book and having it read aloud. I believe this is possible through combining methods in scene text detection, action detection, and text-to-speech. If we can build this, we could create an augmented educational experience that removes screens from the process of learning to read. The dream is to develop the deep learning algorithms to do this with the fast.ai library, embed the models on edge devices with GPU’s and cameras (think along the lines of Nvidia’s Jetson Nano [1]), and let them loose on the world. Given the current state and coming future of edge devices, I can foresee equipping the most high-need schools with these “deep readers” at a relatively low cost.
https://developer.nvidia.com/embedded/jetson-nano-developer-kit [1]
Note: I have never embarked on a complex computer vision project and am hoping to get feedback on the idea as well as any pointers on where to start.
My first goals are going to be implementations of proof of concepts with scene text detection and action detection separately (while using the fast.ai library).
Any and all comments are welcome as well as any skill level of collaborators; I’d consider myself an intermediate coder and novice/intermediate level of machine learning practitioner.