Research project using tabular. Need help

DArXToRm24 · March 26, 2020, 1:53am

Hello, all. I am undertaking a research project utilizing tabular data. I am planning to take a lot of information about cities around the world, and putting them in a Excel sheet. I will label them with scores, one each, concerning happiness. Is it possible to use this model to predict the happiness score of another place? And even more specifically, can I use this to identify the most significant factors in happiness in a country? I’m new to this, but I would really, REALLY, appreciate help on this. thank you so much!

-Samay

muellerzr · March 26, 2020, 1:58am

This where careful design of your validation and test sets are important. To test this theory, let’s pretend I have 3 cities, A,B, and C. In general to do so I would have the following layout:

Training dataset of A
Validation dataset of B
Test dataset of C

So now, while I trained on A and I could kinda see how we did on B, I then further validate/check the accuracy on dataset C, a city that our model has never seen before during training. And this will tell you just how well your model can generalize on a new city.

Does this help some?

DArXToRm24 · March 26, 2020, 2:08am

Thank you so much, I appreciate it more than you know.

A few other things: I am planning to use around only 100 cities, because I’m not sure where to find a complete dataset and will enter some data manually. So I’ll split up 70, 20,10. Is that okay?

Next, I want to input changes to a city. For example, if New York has a happiness rating of 80/100 (or whatever denomination/rating), and then I change the average income to a higher number, I can see how the rating changed. I am planning to do this for many other categories to find categorywhich has the greatest effect, as well as how beneficial a change is.

Also, if I want to set a range for the rating to tell the model it’s correct, how would I go about doing that? Say I tell it to guess and its 1 point off from the real (denomination of 100), how can I set an interval? Should I go with a small denomination of 5, or stay with a larger one?

I know it’s weird, but I hope you can answer some of these. Thank you so much.

muellerzr · March 26, 2020, 2:11am

I’d split it up by cities as I said. Figure out what 20% of your data roughly is and then pick some cities who roughly sum to that amount. Same for the 10%. This way you can get your forecasting

You’d want to look into permutation importance. This will tell you how each input adjusted the output. It works by randomly shuffling a column and then seeing how this impacted the models predictions. It will tell you what variables it most used

You’d want this in a metric probably not in the loss function. Also, experiment