Anomaly(ish) Detection

I’m working with web requests data where each row is a “hit” for example someone accessing http://example.com/

The columns of the dataset are almost all categorical, such as:

  • ID: unique ID for the request
  • URL (e.g. http://example.com/blog): website accessed
  • Referrer (e.g. http://example.com/): accessed from where
  • IP Address (e.g. 127.0.0.1): user IP
  • HTTP Status Code (e.g. 200): status code could be 200, 400, 404, etc.
  • User agent (e.g. Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0): info about user’s browser, operating system
  • Timestamp (e.g. 2017-10-13 17:34:25.209047) when the request happened

My end goal is to implement some kind of model that is able to identify patterns in the web requests and predict ones that are abnormal. My intuition behind this is that the frequency of requests (looking at it as time series data with the timestamps) can be an indication of abnormal activity compared to other relatively normal activity. Also, headless browsers can show up in the user agent and that would not be considered normal. Status codes can also indicate whether someone is trying to fish around, etc… so a large number of 404s for example.

  1. Is it actually possible to do this using ML/DL? If so:
  2. How would one preprocess this kind of data? I’m familiar with one-hot encoding for normal categorical variables, but again seems odd with this kind of data
  3. What kind of model would be suitable if this is possible? PCA, Autoencoders?

You’ll need to use manual feature engineering to extract out all the different URL stuff and IP subnet stuff and user agent features and details. Then you can use a neural net on that.

1 Like

Thanks for the tip! I figured I’d need to do something like this to extract the useful info. The part I’m struggling with is using a neural net on an unlabelled dataset. I’ve read on autoencoders, but in practice with learning and reconstruction, I’m not sure where predicting an anomaly would be.

Any recommendations on the unsupervised route?

You’ll need to create some kind of labels, to tell the model what kinds of anomalies count as interesting.