Classifying users by the way they use keyboard

gesman · March 21, 2017, 1:25am

I have interesting anti-fraud use case that i trying to get a guidance for.

The task is to recognize user by his keyboard activity patterns where user is filling an online forms or creating a free text such as emails.

Factors that are important:

how fast he types
delays between keystrokes
delays between key-down and key-up events.
usage of control keys (DEL, Backspace, Arrows, etc…)
words and sequences of words
etc…

So here we want to consider user’s “favorite” words + timing elements.

Any pointers on how to tackle such task?
I’d love to come up with something simple that works and then build from there.

Gleb

ostegm · March 21, 2017, 3:46am

I wonder if you could start with something simple - which is the errors each user makes while typing. Record the input with errors and compare to the submitted text.

For example, I made numerous errors while typing this, but I backspaced and retyped - so you could compare all of the keystrokes I made to the final submitted keystrokes.

shgidi · March 21, 2017, 5:58am

There are at least two companies at my country (israel) that do this kind of stuff, I don’t know if they use keyboard, but they do use phone and mouse. The names are bioCatch and securedTouch.
The field is called behavioral biometrics, afaik.

kishore_p_v · March 21, 2017, 6:06am

I would proceed in this way:

I like the idea of starting with a simple working model and building on it to capture more complexity, like Jeremy does it. So use logistic regression with the input features - speed, delays (key-strokes and up/down toggle), usage of control keys. Lets call these features1. This model would not do a satisfactory job.
Next to capture the other word related features (specifically favorite words, sequences) get their embeddings (GloVe, word2vec). Add the the embeddings for all the words and the sequences into one single embedding. What I mean is, if for a particular user “home”, “food” and “that is amazing” are the favorite word/phrases add the embeddings for “food”, “home”, “that”, “is”, “amazing” into one single embedding (as Jeremy did it in today’s lecture for Memory Networks to encode a sentence). Including this as a new feature now train a logistic regressor, which might not work great. Next, do a neural network with dense layers/BN blocks.
Now you can include the n (, which is a fixed value say 10) top favorite words/phrases for a user. Concatenate all the n embeddings to get n*size_of_embedding features, which are included along with features1. Train the neural net. As for favourite phrases, you can add the embedding of all the words in the phrase into one single embedding. Phrases can then be included in the features if it falls in the top n favorite.

I plan to construct the same and see how it performs. I am planning to use this dataset: http://www.cs.cmu.edu/~keystroke/DSL-StrongPasswordData.csv

thunderingtyphoons · March 21, 2017, 5:38pm

There is another related interesting problem. If you can listen to the sound of keyboard, can you deduce what keys were typed. It seems like you can https://people.eecs.berkeley.edu/~tygar/papers/Keyboard_Acoustic_Emanations_Revisited/preprint.pdf

jeremy · March 21, 2017, 5:42pm

That’s a clever idea! That would be an interesting feature to include - or at least ensure that the basic data is there to allow a model to build such a feature.

gesman · March 21, 2017, 7:47pm

I actually wonder if such task could be addressed by CNN.
The input is essentially 1D data but with 2 channels:
channel 1: keys typed
channel 2: delay between n and n+1 keystrokes

main_input = Input(shape=(2, 140), dtype=‘int32’, name=‘keystrokes’)
That is considering 140 chars max per input.

Can we then just apply Convolution1D() to that sequence ?

jeremy · March 21, 2017, 8:04pm

Wow that just might work!..

gesman · March 21, 2017, 8:08pm

The intuition here is that CNN are great to catch patterns and that’s exactly what we’re looking for here.
Patterns in keys associated with certain delays, pattern of sequence of keys, excessive control keys usage (arrows, backspace, delete), patterns of consistent delays or at least consistent delays associated with certain keys, etc…

The beauty is that we don’t need to tell CNN what to do - it may do all the discoveries for us!

kelvin · March 21, 2017, 8:20pm

If you watch the NIPS-2016 GAN Tutorial. Goodfellow remarks (I’m paraphrasing): I’m in the school of Deep Learning so I can’t do any manual feature engineering.

xinxin.li.seattle · April 13, 2017, 7:08pm

@gesman, I’m curious about your experience on this project…have you consider trying a RNN model given that the order of keystrokes might matter and your data sounds to be a time-series type?

gesman · April 13, 2017, 7:39pm

My feeling is that’s not the task for RNN - where sequence and memory is important - but rather more task for CNN where patterns and relationships between neighboring data pieces is important.

I’m getting closer to do first tests here.

Gleb