Including datasets in nbdev generated python library

JonnoB · November 11, 2022, 6:46pm

Hi I am new to making python libraries and nbdev. Over the last couple of weeks I have been making a library and loved how easy it is with nbdev. Previously I have made R packages using Rmarkdown, but nbdev is absolutely fantastic. However, I do have an issue which is that I want to include some small csv files as datasets that I can load into the library. I have not been find how to do this using the documentation and can’t see how it is done when looking at other nbdev projects.

I have found some approaches on stackoverflow but these seem to not work for nbdev style libraries

My folder structure looks like this

The answer is probably very obvious so I apologise in advance if it is, but I have not idea what to do and it is the final piece to complete the mvp.

Thanks for you help

Jonno

kaiguender · November 11, 2022, 7:46pm

I happen to have faced the same problem today and managed to find a solution that worked for me. I created a folder called ‘datasets’ inside the folder, where all the .py files ofmy package are stored. To be clear: In my case, the name of my package is ‘dddex’, so what I mean is that I put the mentioned folder called ‘datasets’ into the ‘dddex’ folder.

In order to deploy your package with your .csv files, you have to add

include_package_data = True,
package_data={‘’: [‘datasets/*.csv’]}

to your setup.py file inside the

setuptools.setup()

call. That is all I had to do. If you intend to include code that opens one of your .csv files, you have to make sure that the path you are using to open your csv-file is created dynamically, so the path works independent of where exactly someone stored the package. In my case,

dataPath = pkg_resources.resource_stream(name, ‘datasets/dataYaz.csv’)

the package pkg_resources turned out to be very useful for this task. You can also have a look at the github repo of my package, maybe that helps as well:

I hope that helps.

JonnoB · November 11, 2022, 10:08pm

Thank you Kai,

What a coincidence! Looking at your approach It seems I have the data folder too high up in the structure. I have moved it and added the bits you mentioned. However I then get a bunch of errors caused by the pklg_resources library

example

“ERROR: Could not find a version that satisfies the requirement pkg_resources (from pyseatrials[dev]) (from versions: none)”

Did you have issues with this?

JonnoB · November 13, 2022, 3:03pm

I resolved the pkg_resources issue by replacing it with pkgutils, this is apparently a more modern way of importing data from packages along within importlib.resources (see this SO post).

Once that was fixed I found that the dataset was not being installed along with the rest of the package.
@kaiguender had put the following


include_package_data = True,
package_data={‘’: [‘datasets/*.csv’]}

in his setup.py whereas I had put mine in the settings.ini. When the settings.ini is used the datasets are not installed, whereas doing it Kai’s way they are. So thank you Kai!

I thought that the setup.py should not be changed and only the settings.ini should be used is this not the case? is it a bug or a feature?

jeremy · November 13, 2022, 6:50pm

This is not the case.