I’m working with web requests data where each row is a “hit” for example someone accessing http://example.com/
The columns of the dataset are almost all categorical, such as:
- ID: unique ID for the request
- URL (e.g. http://example.com/blog): website accessed
- Referrer (e.g. http://example.com/): accessed from where
- IP Address (e.g. 127.0.0.1): user IP
- HTTP Status Code (e.g. 200): status code could be 200, 400, 404, etc.
- User agent (e.g. Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0): info about user’s browser, operating system
- Timestamp (e.g. 2017-10-13 17:34:25.209047) when the request happened
My end goal is to implement some kind of model that is able to identify patterns in the web requests and predict ones that are abnormal. My intuition behind this is that the frequency of requests (looking at it as time series data with the timestamps) can be an indication of abnormal activity compared to other relatively normal activity. Also, headless browsers can show up in the user agent and that would not be considered normal. Status codes can also indicate whether someone is trying to fish around, etc… so a large number of 404s for example.
- Is it actually possible to do this using ML/DL? If so:
- How would one preprocess this kind of data? I’m familiar with one-hot encoding for normal categorical variables, but again seems odd with this kind of data
- What kind of model would be suitable if this is possible? PCA, Autoencoders?