After a couple more attempts, I was able to get nbdev
set up for developing DAGs . If you have the patience for twiddling with imports and setting up deployment pipelines, I definitely recommend it. The software development lifecycle is much nicer via nbdev. Even for DAGs.
The main tricks involved not installing the DAGs as a library, getting directory structures just right, and matching the imports to work with the directory structures.
This was tricky because we will need to writing code that needs to work when:
- building the docs,
- generating the library/source code,
- developing within Jupyter Lab,
- running in your local Airflow setup, which is probably a fairly opinionated Docker setup,
- published into the production Airflow DAG deployment folder.
Turned into a bit of a balancing act. The nice part is that nbdev
fails very gracefully, being a fairly modular system. Basically, your docs don’t need to be building for the library .py
code to be generated or for tests to run. And the library generation doesn’t have to work right away either. You can just start to develop directly in the notebook until you have something worth testing and generating code from.
a) Do not rely exclusively on settings.ini
.
Since we’re not installing or publishing the code as a library, we can put our build instructions directly into a Makefile (and/or GitHub Actions YML workflow or however you usually do it). This lets us customize as necessary depending on what command or environment we’re interested in.
Here’s one example of the complications: the production Airflow deployment (in AWS’ MWAA) needs to have a custom requirements.txt
file uploaded. with an --extra-index-url
with private credentials. That way it can pull private dependencies from a private artifact repository. Yet locally, it has to get the credentials differently. It also needs a --constraints
flag needs to be passed to pip install
so the versions match what’s provided by Airflow in production.
I don’t think this would be possible with the settings.ini
requirements related fields. Those are mostly intended for generating a clean setup.py file for building libraries. It is not really intended for configuring messy application details. At least it wasn’t easy to figure out. But by stepping away from the nbdev settings.ini
file, we can make a more sophisticated build that’s very similar to any other deployment.
use GitHub Actions CI when deploying to production, which sources those credentials from its GitHub Secrets fields. And locally, we can use another source for getting the credentials and leverage a requirements-dev file as well.
b) use full paths in all imports
Let’s say you want to import MyUtil
class from the utilities
module. You’ll need to import like this:
from mylib.dags.utilities.my_util import MyUtil
Here’s roughly how it’s set up now:
# settings.ini (folder structure related values)
lib_name = mylib
nbs_path = .
doc_path = docs
lib_path = mylib
# folder structure
00_dags.utilities.my_util.ipynb
01_dags.my_dag.ipynb
/my_lib
/dags
my_dag.py
/utilities
my_util.py
My first successful attempt actually put the notebooks living alongside the generated library code. That worked really nicely, until I ran into issues getting docs and tests to pass at the same time. I think nbdev
just still has a few quirks to work out. Will have to keep thinking throught that - maybe there’s a way to build that support, but I don’t have any good ideas yet.