Machine Learning applied to Distributed Systems

Hello all,

I am reaching out to the community to seek some help regarding a problem I am facing. Apart from suggestions about relevant literature, similar work, would be interested in knowing if what I am thinking is correct.

I am currently working on a problem to predict resource utilisation for Hadoop based services and jobs. For example, by looking at certain metrics, can you predict how much will be resource utilisation this weekend? Is there any form of scaling required that can mitigate a possible shortage of services. I do not have any labels but only metrics about cpu, IO, memory, network, and other Hadoop metrics.

Problem Statement: Given metrics, predict/forecast the cpu, network, IO, etc at certain instances in time and inform the relevant users.

Probable Solution Approach: Flatten out each of the metrics about cpu, mem, net, IO, etc and using the ‘t-n to t’ points (where t is current time instance and t-n is an instance from the past), predict t+1 to t+n.

However, the problem in the above approach is that I do not have labels to train the algorithm against. Would be glad if anyone can suggest possible ways of approaching the problem.

Below are some features,
CPU (idle, iowait), MEM (free, total, swap), DISKIO (read_time, write_time, read_bytes, write_bytes), etc. I have flattened the respective features into a single feature vector. There are other Hadoop based metrics that I cannot share here but the overall nature of the features is similar.

I am relatively new to the domain and hence request you to overlook the naivety of the question/solution approach.

Finally, any good python resources (courses, papers, blogs, etc) for Anomaly Detection, Machine Learning applied to systems, anyone can suggest? Also let me know if more justification is required as I am aware that I could not explain everything fluidly.