How do you currently find datasets for your projects?

ebd · October 9, 2020, 4:08am

Hi all. A friend and I have been working on a tool to help people pull down relevant parts of tabular datasets, and we were curious to learn about how people usually find their datasets, and get some feedback on what could be improved about the process.

We’ve written the beginnings of a python library that can search for datasets among the ones we’ve indexed that match your dataset on lat/lng and date/time. An example, inspired by Jeremy’s Rossman example from one of the notebooks, is that you can find data that’s relevant to the times and locations in the tabular data like so:

import pandas as pd
import our_library
my_df = pd.read_csv('rossman.csv')
column_type_mapping = {
    type.latitude: 'MYLATCOLUMN',
    type.longitude: 'MYLNGCOLUMN',
    type.datetime: 'MYDATECOLUMN'
}
my_df.our_library.discover(on=column_type_mapping)

[4/4] 'noaa.global_summary_of_day' (1.75M of 151M rows relevant (1.16%))
[4/4] 'epa_aqs.ozone_daily_summary' (750K of 12.3M rows relevant (6.12%))
[4/4] 'epa_aqs.so2_daily_summary' (1.50M of 17.5M rows relevant (8.59%))
[4/4] 'epa_aqs.co_daily_summary' (750K of 11.6M rows relevant (6.46%))
[4/4] 'epa_aqs.no2_daily_summary' (500K of 5.08M rows relevant (9.84%))
[4/4] 'usgs_comcat.summary' (250K of 3.29M rows relevant (7.61%))

And then if you decide you’re interested in adding weather data (as in the Rossman notebook) to give your model some context as to why attendance figures were what they were on that day, the library can download just the parts of the NOAA weather data for the locations of the stores and the dates of the attendance data listed like so:

my_df.our_library.fetch_chunks_df('noaa.global_summary_of_day', on=column_type_mapping).head()

       station        date   latitude  longitude  elevation  ... has_rain_or_drizzle  has_snow_or_ice_pellets  has_hail  has_thunder  has_tornado_or_funnel_cloud
0  42111099999  2019-03-01  30.316667  78.033333      682.0  ...               False                    False     False        False                        False
1  42111099999  2019-03-02  30.316667  78.033333      682.0  ...                True                    False     False        False                        False
2  42111099999  2019-03-03  30.316667  78.033333      682.0  ...               False                    False     False        False                        False
3  42111099999  2019-03-04  30.316667  78.033333      682.0  ...                True                    False     False         True                        False
4  42111099999  2019-03-05  30.316667  78.033333      682.0  ...               False                    False     False        False                        False

We’ve been hacking on the fun crunchy challenges this involves for a bit, but we realized that we never really bothered to check to see if anyone else wanted something like this. Does this sound like something you might be interested in?

If so, any requests for public datasets that would be be useful that we should try to pull in? So far we just have a few from NOAA, EPA, Census, USGS, and some other public orgs.

Could you see yourself exploring datasets using a python library and interpreter/notebook, or do you think you’d rather mostly do that on a website? Could you see yourself ever uploading a dataset to a public searchable repository of datasets? Anything you’d be particularly concerned about when doing so, like licenses?

Thanks very much for any feedback, positive or negative.