I am a software developer from Australia, I’ve been learning/watching lessons from fastai recently, and I really liked it.
Just recently I have been tasked with optimizing resource allocation for services running on AWS (Elastic Container Services, CPU utilization vs actually allocated CPU Cores) and figured ML would be the best fit for the job.
I have been following the materials on Deep Learning for coders, however I just keep getting negative results for rsquare and there’s a high error rate for it. I have tried to adjust the data, but it doesn’t seem to help. I vaguely remember someone in the course experienced the same problem as well, model predicting less than mean?
Any help would be much appreciated, I have attached the github repo and screenshot below.
There is a possibility that there just inst great correlation.
It seems you only have very few columns of which only one is likely to have a relation to the prediction? I’m guessing maybe it just cant work it out. What is the data set?
The dataset composed of CPU usage for all the services in a cluster and number of healthy nodes behind its services. However, the services in the cluster don’t necessarily communicate, depending on each other, that’s probably problem number 1?, problem number 2 is lack of columns and enough details about the CPU usage. So I’m going to add few more things in there and hopefully it will improve, such as number of requests at that time, and application workloads.
Your model
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1)
From the docs
n_estimators : integer, optional (default=10)
The number of trees in the forest.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Seems like your parameters are limiting the analysis.
also set ‘bootstrap’ = True to use the full potential of the Random Forest. After adding more features I would also experiment with different values for the ‘max_features’ argument. Look also in the sklearn docs for some potential values ;-).
I also agree with the previous speakers. I think there is slight correlation between the descriptive features and the target variable. For a quick check, use the spearman or pearson correlation that are also implemented in the pandas library.
Because you are working with time series, I would advice you to create more feature using aggregation over the last seconds, minutes, hours, … after adding your addressed features. Therefore you can the use mean, median, std, … .