Converting relational data to tabular data

aayres · February 1, 2019, 12:44pm

I may have the opportunity to use fast.ai’s tabular.data to train a model for a niche insurance underwriting proof of concept, for which I have to put together a proposal.

The training and validation data will come from historical online questionnaire answers, currently stored in a SQL database, for which the dependent variable is known.

I’ve read https://docs.fast.ai/tabular.data.html, Rachel’s blog post https://www.fast.ai/2018/04/29/categorical-embeddings/ - and watched Rachel’s workshop https://www.youtube.com/watch?v=25nC0n9ERq4 and the course-v3 (now course 2019) lessons, and have the following remaining questions;

is the general approach of converting relational data (multiple RDMS tables, using foreign keys to link tables) to tabular data (ie. one ‘flat’ table of data, where each column represents either a category or continuous variable) the correct approach?
should I handle the one-to-many, many-to-many relations by expanding them out into a single table, with a limit to the upper-bound of ‘many’? For example, the questionnaire may have asked what countries the applicant visited in the last 5 years, when, and for how long. An applicant may have visited between 0 and 10 countries, for example (10 being a limit imposed on the original questionnaire). Should I convert this into 30 columns containing 30 categories? ie. VisitedCountry1Name, VisitedCountry1When, VisitedCountry1HowLong, VisitedCountry2Name, VisitedCountry2When, VisitedCountry2HowLong…etc. (where the category value is None/null if the applicant didn’t visit a country for the Nth time). Let’s say there is a significant relation between visiting country X and the value of the dependent variable. Could it weaken the predictive power of the model that ‘country X’ could appear in any one of these categories? (VisitedCountry1Name, VisitedCountry2Name, etc.). Should I have a single CountriesVisited category, which contains a value representing all countries and whether they were visited or not? (maybe a vector/embedding were 0=NotVisited and 1=Visited, [0,1,1,0,…]), or where countryvisited?/date(yyyy-mm-dd)/duration(days) are each embedded, [[1,2015,14], [0,0,0], …], so a single CountriesVisited column contains all this data?
this may be similar or even identical to 2. above, but where a concept like ‘countries visited’ could be embedded into a single vector, since all countries are known, there are cases where the collection of ‘entities’ isn’t of fixed size. For example, the questionnaire asks about family members, and details about each family member. Whereas the relational data is stored in tables with foreign keys, should I convert this to categories for each family member, up to the number of family members allowed by the original questionnaire? e.g. say the limit is 5 family members, should the categories be things like FamilyMember1Name, FamilyMember1Age, FamilyMember2Name, FamilyMember2Age…etc. - so it can all be contained in a single table of data? Or maybe there’s a way to embed ‘FamilyMembers’ data into a single column, where each value is a list of family member entity embeddings, or have a single column for each family member, where each value is a single family member embedding.
as a general point, I’m concerned about whether there’s any value to be obtained from analysing historical questionnaire answers (hence the current ‘proof of concept’ status this project has), and whether there’s any bias in using such data (maybe a trickier issue, as it would require analysing potential bias in the original questionnaire, and how it was carried out). Obviously, the data will have to be anonymised before it is used (e.g. removal of any identifying data, such as names and addresses). Postal codes could, at least in the UK, also be used for identification (since they can narrow down to just a few houses), but they can be mapped to “output areas” (codes used by the UK’s Office for National Statistics) which are still useful, but not able to be used to identify an individual. I haven’t looked into this yet, but maybe encrypting the dataset is another possibility.

If anyone has any advice or guidance on 1-4 above, or regarding working with tabular data in general, I welcome it. I’m also happy to work with anyone else working on something similar, in case we can help each other. In the meantime, I’m going to research this subject and run some tests, to see if I can find the answers.

Ralph · February 2, 2019, 12:57pm

I would build a single denormalized table, and treat the various visits/events as separate line items. Treat each policy as a categorical variable and build embeddings for that category to distill all the inputs.

I’ve got transactional data in a relational database. I built a bunch of identity tables for all categorical variables and then built a view to join all the source tables with the identity tables so the output was a big fat denormalized set with integer values for all the categorical values. Then you can just select * from the view into your dataframe.

aayres · February 3, 2019, 9:00am

Thanks Ralph for replying and sharing how you handle this kind of scenario, much appreciated .