Notebook orchestration

fabsta · January 18, 2020, 7:03pm

Hi!
First of all, thanks for developing nbdev. It makes our life as data scientists so much easier.
I have numerous notebooks describing different workflows.

Is there a way in nbdev to orchestrate the executing of a predefined sequence of notebooks? Something like make but with proper documentation and the possibility to define dependencies?

Thanks again!

fabsta · January 25, 2020, 8:44am

Just fyi. We have been using taskfile so far and are quite happy: https://taskfile.dev/#/
It allows you to define dependencies and run full jupyter notebook workflows from the command line. An alternative would be e.g. Airflow.

yegeniy · July 20, 2021, 4:23am

Thanks for starting this topic @fabsta. I’m interested as well.

Has anyone had any luck developing production Airflow DAGs using nbdev? I’d love to hear any suggestions or experiences you’ve had.

yegeniy · July 23, 2021, 5:09am

I failed at my first attempt at using nbdev for an Airflow deployment. I’ll try again at some point, because I don’t think my issues were fundamental to Airflow or nbdev. Basically I didn’t find any good reason to keep writing Airflow DAGs in .py files. But I did find a lot of hurdles to doing it in Jupyter notebooks.

First, a big shoutout to lib2nbdev, without which I wouldn’t have even attempted converting the 60+ DAGs we have into notebooks. If you haven’t seen it, it’s very simple to use, and I hope that it can drive a bit of adoption of literate programming in the corporate software engineering space.

Currently nbdev seems focused on a specific use case: creating delightful python libraries using Jupyter Notebooks. In that I think it’s succeeded and moved the software engineering world forward. It’s shown what a good literate programming environment can be and it’s been used to produce some awesome libraries as a result. And I think that’s exactly the focus it needs to have to gain adoption and inspire future work.

But I just haven’t been able to easily get it deployed into a mature production system. I can see how to get an nbdev-generated library in to a traditional web application via its requirements.txt. But not how to develop that application in nbdev. I’m sure it’s just a matter of my lack of experience with nbdev and the fact that it’s still early days for this library.

Without going into mundane technical details and Airflow-specific issues, I hit a lot of speed bumps. Suffice it to say, Airflow’s many quirks certainly don’t help. But I think there might be some things to improve about nbdev’s documentation as well. Just wish I knew what !

My overall impression is that it’s hard to tell from the nbdev documentation what parts of it are fundamental to doing literate programming in notebooks and what parts can be swapped out. It’s gluing a lot of technologies together, so that’s understandable.

But I still have a lot of unresolved questions. Mostly around which parts of settings.ini I can avoid manually. I’m sure it’s possible, but I couldn’t figure out how to hack nbdev to my needs. I was trying to get it to generate the docs and the code, but I needed to use a custom requirements.txt (with --constraints).

Anyways, hope this isn’t taken the wrong way. I’m a big fan of the project and am following from the sidelines.

yegeniy · July 29, 2021, 4:53am

After a couple more attempts, I was able to get nbdev set up for developing DAGs . If you have the patience for twiddling with imports and setting up deployment pipelines, I definitely recommend it. The software development lifecycle is much nicer via nbdev. Even for DAGs.

The main tricks involved not installing the DAGs as a library, getting directory structures just right, and matching the imports to work with the directory structures.

This was tricky because we will need to writing code that needs to work when:

building the docs,
generating the library/source code,
developing within Jupyter Lab,
running in your local Airflow setup, which is probably a fairly opinionated Docker setup,
published into the production Airflow DAG deployment folder.

Turned into a bit of a balancing act. The nice part is that nbdev fails very gracefully, being a fairly modular system. Basically, your docs don’t need to be building for the library .py code to be generated or for tests to run. And the library generation doesn’t have to work right away either. You can just start to develop directly in the notebook until you have something worth testing and generating code from.

a) Do not rely exclusively on settings.ini.

Since we’re not installing or publishing the code as a library, we can put our build instructions directly into a Makefile (and/or GitHub Actions YML workflow or however you usually do it). This lets us customize as necessary depending on what command or environment we’re interested in.

Here’s one example of the complications: the production Airflow deployment (in AWS’ MWAA) needs to have a custom requirements.txt file uploaded. with an --extra-index-url with private credentials. That way it can pull private dependencies from a private artifact repository. Yet locally, it has to get the credentials differently. It also needs a --constraints flag needs to be passed to pip install so the versions match what’s provided by Airflow in production.

I don’t think this would be possible with the settings.ini requirements related fields. Those are mostly intended for generating a clean setup.py file for building libraries. It is not really intended for configuring messy application details. At least it wasn’t easy to figure out. But by stepping away from the nbdev settings.ini file, we can make a more sophisticated build that’s very similar to any other deployment.
use GitHub Actions CI when deploying to production, which sources those credentials from its GitHub Secrets fields. And locally, we can use another source for getting the credentials and leverage a requirements-dev file as well.

b) use full paths in all imports

Let’s say you want to import MyUtil class from the utilities module. You’ll need to import like this:

from mylib.dags.utilities.my_util import MyUtil

Here’s roughly how it’s set up now:

# settings.ini (folder structure related values)
lib_name = mylib
nbs_path = .
doc_path = docs
lib_path = mylib

# folder structure
00_dags.utilities.my_util.ipynb
01_dags.my_dag.ipynb
/my_lib
  /dags
    my_dag.py
    /utilities
      my_util.py

My first successful attempt actually put the notebooks living alongside the generated library code. That worked really nicely, until I ran into issues getting docs and tests to pass at the same time. I think nbdev just still has a few quirks to work out. Will have to keep thinking throught that - maybe there’s a way to build that support, but I don’t have any good ideas yet.