World Inequality Dataset - Entity Embedding/Data Cleaning Practice Problem?

Hi all! The World Inequality Database has just published its 2018 World Inequality Report, summarizing the research of 100 academics around the world who invesigate and document capital flows around the world from 1980. (Nice description here: https://boingboing.net/2017/12/14/oligarchy-on-ice.html). The implications of this research are that inequality is widespread, somewhat inevitable, getting worse, and will continue to worsen. In addition to publishing a sobering report, the research team has published the full data set from which their findings came.

So I thought, since we’ve learned here about this great new technique for deep learning on structured data sets (entity embedding of categorical variables), why don’t we see what such a model can learn from this data? The dataset is here, and contains structured csv files with general macroeconomic and inequality data. We can go through all steps, from data cleaning, picking continuous and categorical variables, and picking an appropriate train/test split.

Some questions I’ve been thinking about:

  1. Which variable is worth predicting here, considering how much categorization there is in this dataset, especially in the income concentration columns. Should it be the amount of wealth concentration in the highest percentiles of income?
  2. A strategy for combining each dataset could be to create a column in each dataframe with a country code, and another for ‘dataset type’ for ‘MacroData’ vs. ‘InequalityData’, then joining all files into one large dataframe. Also, should train/test split be done as a percentage of each country’s dataset, or cut on countries themselves?

Some things I’ve noticed about the raw data set so far:

  1. Each of 313 countries has a country code.
  2. Each of 313 countries has two files
    WID_JP_InequalityData.csv and WID_JP_MacroData.csv, for country code JP, meaning Japan.
  3. There’s a large disparity in the size of each country’s data. For example, Japan’s InequalityData has ~1460 rows whereas my country’s (Kenya) has ~190.

There’s more in a readme.txt in the full data set. Let’s study this thing and flex this new muscle!

I’m so happy to have participated in this class. Thank you @jeremy, @rachel, @yinterian for all your hard work and excellent technique.

1 Like

It looks like the financial data is normalized according to 2015 local currency, for each country.