I have interesting anti-fraud use case that i trying to get a guidance for.
The task is to recognize user by his keyboard activity patterns where user is filling an online forms or creating a free text such as emails.
Factors that are important:
- how fast he types
- delays between keystrokes
- delays between key-down and key-up events.
- usage of control keys (DEL, Backspace, Arrows, etc…)
- words and sequences of words
So here we want to consider user’s “favorite” words + timing elements.
Any pointers on how to tackle such task?
I’d love to come up with something simple that works and then build from there.
I wonder if you could start with something simple - which is the errors each user makes while typing. Record the input with errors and compare to the submitted text.
For example, I made numerous errors while typing this, but I backspaced and retyped - so you could compare all of the keystrokes I made to the final submitted keystrokes.
There are at least two companies at my country (israel) that do this kind of stuff, I don’t know if they use keyboard, but they do use phone and mouse. The names are bioCatch and securedTouch.
The field is called behavioral biometrics, afaik.
I would proceed in this way:
- I like the idea of starting with a simple working model and building on it to capture more complexity, like Jeremy does it. So use logistic regression with the input features - speed, delays (key-strokes and up/down toggle), usage of control keys. Lets call these features1. This model would not do a satisfactory job.
- Next to capture the other word related features (specifically favorite words, sequences) get their embeddings (GloVe, word2vec). Add the the embeddings for all the words and the sequences into one single embedding. What I mean is, if for a particular user “home”, “food” and “that is amazing” are the favorite word/phrases add the embeddings for “food”, “home”, “that”, “is”, “amazing” into one single embedding (as Jeremy did it in today’s lecture for Memory Networks to encode a sentence). Including this as a new feature now train a logistic regressor, which might not work great. Next, do a neural network with dense layers/BN blocks.
- Now you can include the n (, which is a fixed value say 10) top favorite words/phrases for a user. Concatenate all the n embeddings to get n*size_of_embedding features, which are included along with features1. Train the neural net. As for favourite phrases, you can add the embedding of all the words in the phrase into one single embedding. Phrases can then be included in the features if it falls in the top n favorite.
I plan to construct the same and see how it performs. I am planning to use this dataset: http://www.cs.cmu.edu/~keystroke/DSL-StrongPasswordData.csv
There is another related interesting problem. If you can listen to the sound of keyboard, can you deduce what keys were typed. It seems like you can https://people.eecs.berkeley.edu/~tygar/papers/Keyboard_Acoustic_Emanations_Revisited/preprint.pdf
That’s a clever idea! That would be an interesting feature to include - or at least ensure that the basic data is there to allow a model to build such a feature.
I actually wonder if such task could be addressed by CNN.
The input is essentially 1D data but with 2 channels:
channel 1: keys typed
channel 2: delay between n and n+1 keystrokes
main_input = Input(shape=(2, 140), dtype=‘int32’, name=‘keystrokes’)
That is considering 140 chars max per input.
Can we then just apply Convolution1D() to that sequence ?
Wow that just might work!..
The intuition here is that CNN are great to catch patterns and that’s exactly what we’re looking for here.
Patterns in keys associated with certain delays, pattern of sequence of keys, excessive control keys usage (arrows, backspace, delete), patterns of consistent delays or at least consistent delays associated with certain keys, etc…
The beauty is that we don’t need to tell CNN what to do - it may do all the discoveries for us!
If you watch the NIPS-2016 GAN Tutorial. Goodfellow remarks (I’m paraphrasing): I’m in the school of Deep Learning so I can’t do any manual feature engineering.
@gesman, I’m curious about your experience on this project…have you consider trying a RNN model given that the order of keystrokes might matter and your data sounds to be a time-series type?
My feeling is that’s not the task for RNN - where sequence and memory is important - but rather more task for CNN where patterns and relationships between neighboring data pieces is important.
I’m getting closer to do first tests here.