How to use nbdev for non library ML pipeline?

saneshashank · December 15, 2020, 7:21am

nbdev seems great for package development but 90% time in production we need ML pipeline that do not explicitly involve developing new packages. For example I might want to generate train.py (or predict.py) from my notebook for production. In this case since the level of notebook is different from that of scripts generated the paths will not work, for example if I have a "data’ folder which contains data the relative path to access data will be different in notebook as compared to script ? Anyone having any thoughts on this ?

butchland · December 15, 2020, 10:11am

Hi @saneshashank,

My suggestion is to still use nbdev to build reusable routines as packages.
You can have your “tests” inside the notebook where your routines reside.
You probably wouldn’t to hardcode your paths into your routines anyway.

Then using fastcore’s script utility, you can turn these routines into commands
with arguments and help messages.

See this link for more info.

Your ML pipeline can then just be bash scripts in conjunction with config files or
whatever you want.

As suggested, you can do editable installs so you don’t have to keep reinstalling the
package as you update your notebook (just run nbdev_build_lib to sync them).

HTH.

Best regards,
Butch

akhilvempali · September 20, 2022, 8:21pm

To add to @butchland you can also explore the pkg_resources library which lets you use the same reference to files/folders that are referred to by the “script” in your notebooks. I found it to be very helpful while developing CLI applications.

import pkg_resources
filepath = pkg_resources.resource_filename(__name__, '../data/sample.json')
print(os.path.abspath(filepath))

/Users/akhilvempali/Documents/personal/<repo_name>/data/sample.json

seem · September 21, 2022, 1:19am

Great question and really good answers above I’m excited about nbdev’s potential for ML pipelines. Some tips that might also be helpful as well…

Repo-relative paths in notebooks and modules

nbdev’s config system has built-in support for this:

>>> from nbdev.config import get_config
>>> get_config().config_path
Path('/Users/seem/code/nbdev/data')  # depends on your repo location

Config.config_path is always an absolute path to your repo root.

To avoid repeating yourself you can also add keys to your settings.ini, for example:

data_dir = data

…then use get_config().path('data_dir') which will return an absolute path to {your_repo}/data

Exporting scripts & running them

Notebooks can export to scripts (using fastcore.script if you like, or by exporting script code as is).

You can then run them like so:

python -m your_package.train