I am thinking of creating an American Sign Language (ASL) detection system using videos. I have two options at the moment.
One is using 3d convolutions, and classifying the sign to its right word. My concern for this approach is that the number of words will be quite large, meaning I will have to have a softmax layer that will have thousands categories(one for each word). This would then mean I’d need more resources to train (because ill need a deeper network)
CNN-LSTM may be a decent choice too. I was planning on taking a sequence of frames, passing it through a CNN then converting it to a time distributed sequence, which I would then use to predict the class of, using an LSTM.
I was thinking of doing this real-time, so I am not sure of how to segment the live video feed to contain exactly one sign. My intuition is that if I have multiple signs in the same sequence of frames, the CNN will just pick one over the other.
I am not sure about LSTMs, if its possible to get multiple words.
What do you think a strong architecture for this type of problem would be? Thanks in advance!