Predicting Rare Events from Sequences (System Outage from Alarms)

Hello,

I have a problem I am interested in, and I am thinking of it in the context a neural network sequence model (any appropriate variation of RNN):

Data:
I have lots of (categorical) alarming data with a few fields and lots of unique values, and I have a few (categorical) ‘incident’ data, which really only share the timestamp field with the alarming data. The data is messy and overly-precise (unique ID names etc.) BUT let’s assume my data is perfect and has all the information at the right granularity that I might need.

Task:
I want to predict the likelihood of an ‘incident’ occurring within a future time-frame (say one hour from now) given the preceding alarming events sequence for a specific duration of time.

Evaluation/Target-Metric
The BIG POINT is that the incidents are extremely rare relative to the alarms, say 1000 to 1, which means what I want to predict is a negligible occurence, so I am stumped as to how a pattern might be learnt for it (as typical sequence predictions operate on the basis of accuracy, and if they get the next in the sequence right 99% of the time, which they can easily since it is almost always a ‘no incident’, then they take that tack). So what kind of metric is it possible to use, how can i have one which doesn’t bother too much about True Negatives (i.e. is not an incident and it predicts its not), and focusses instead all its efforts on True Positives (i.e. what is an incident and it predicts it is)?

Embedding:
How do I go about creating the appropriate embedding to pass into a sequence network which captures the feature values…how do i turn my categorical data into something numeric (or do i not do i just randomly initialise values and have an arbitrary number of features…but then how does the network learn anything about the particular feature value for prediction…is it just that i make a matrix where the rows are the unique feature values and whenever that value appears the matrix combines with the weights such that that feature value vector is updated in the backpropagation…but this means that i am essentially only training on one real feature since the rest being columns with randomly initialised values have not encoded any information from their actual categorical values.)

Modelling:
How do I actually model the data in terms of the predictors and the response variable. Do I just add a 1 after any alarm which has an incident at the next time-step (say 1 sec from it, or just assuming a relevant time-step), but this would seem to overweight that alarm in causing/correlating with the incident (when it could in fact have nothing to do with it), or do I add a 1 to all alarms within a particular window of time from the incident. The important thing is the entire sequence pattern over a particular window of time…how do I manage to set up my target vectors to achieve that. And how do I make choices about the number of alarms (context-window size of events) to pass in at each time-step through the network. In addition what are the possible mechanisms for maintaining a lot of historical strength (i.e. the recent events are not overweighted and the previous sequence for a great duration still holds sway in prediction).

Any advice on how to think about and model this problem would be much appreciated.
Many thanks,
V

If you haven’t already, look into the field of “anomaly detection”.