Algorithm suggestion for magic cake optimization

I’ve greatly enjoyed making projects using CNNs, RNNs and GANs, however there’s a simpler-sounding problem I’m not sure how to approach using deep/machine learning/data science.

Let’s say we have data from the Magic Cake factory where we have 10,000 rows of info on their cake ingredients, and how each batch of cake turned out over the last year. They’re using flour, coconut, maple syrup and their own secret magic mix, which reduces the amount of calories in the cake so everyone can indulge to their hearts’ content.

The magic mix is expensive, so ideally they want to use as little as necessary to reduce the calories to the goal level of around just 50 calories per cake (we all need some calories, after all). The data we have shows sometimes they add too much magic mix, resulting in 45 calories per cake, and sometimes they add too little, and end up with 55-60 calories per cake. Other times, they get it just right, and end up with 49-51 calories per cake.

Now, with this data, how would we optimize the model so that we predict the best amount of magic mix to use? The obvious approach to me was to use a linear regression and this is fine up to a point – we are after all fitting a line based on what the humans have done at the cake factory, so it will show those biases without necessarily getting to the point where we can enter the ingredients and cake size into the model and it will tell us the best amount of magic mix to use. Logistic regression seems a good way forward to see the importance of the other variables (perhaps creating classes based on whether the calorie count was accurate (1) or not (0), but may not come up with the best suggestions for amounts of magic mix.

I would love any suggestions.

I think that a linear regression model trained on this data would allow exactly to predict the amount of magic mix required for a fixed calorie count, cake size and amounts of other ingredients, wouldn’t it?

That’s what I thought, but then the data inputs include human inaccuracy – for example, if generally employees tend to add too much magic mix, the model will too consider these inaccuracies normal, and inherit the same biases. Perhaps an idea is to only consider the data points where they achieved the target calories exactly?

Is the calorie count one of the features in your dataset? If yes, then the model should learn the correlation between “too much magic mix” and “fewer calories” and when you put the right target of calories during inference it should suggest the right amount of magic mix.

Thanks Darek. Yes, the calorie count is in there. I’ve tried this on a few different problems and generally the model tends to inherit the flaws of the data but I’ll keep trying. It’s good to read the LR approach is the way.

this is how I imagine this would look assuming all other features constant

Love it! Thank you very much for your advice and this awesome visualization!