Improving the jupyter notebook workflow

Jupyter lab is great, but it isn’t suited to everything

In a data science workflow, you might have cleaned your data and be quite happy with it, leading to a section of your notebook that stays relatively unchanged while you work on other parts of the problem. That section can grow quite large, and I find huge notebooks to be quite overwhelming

You may also want to share portions of that code with other projects/notebooks

Another issue with .ipynb files is that they produce unmanageable diffs from version control software like git (although there are other tools which produce nice visual diffs for notebooks - more on that later!)

Finally, it’s easy to accidentally rely on global variables, which is fine until it’s terribly confusing, as in the following made-up but all too common example:

df = pd.DataFrame(my_data)

def make_predictions(threshold: float):
    return df.sum() > threshold

make_predictions(2.0)

“hmmm - it’s not looking too good. Let’s improve the data”

df = pd.DataFrame(my_data)
df_cleaned = clean_data(df)

def make_predictions(threshold: float):
    # woops - missed the fact that df here is a global variable
    return df.sum() > threshold

make_predictions(2.0)  # strange, cleaning the data didn't seem to help

Either way, it can be helpful to slowly build a library of useful functions as you explore your problem, first writing them in your notebooks before copy-pasting them into a .py file once they’ve been debugged and stood the test of time

This is how I balance the two extremes!

There is a note at the end of this post about a totally different fix to this problem - you may prefer that approach

Also, this all works well with my remote notebooks setup, which explains how to run your high performance number-crunching code on a powerful computer, which you access from a lightweight laptop

First, though, let’s go through some more basic Python stuff. You can skip to Combining jupyter lab and VS code if you already have a Python setup and know how to structure a basic Python library

Python and Pylance extensions

If you exclusively use notebooks, you may not have a more typical IDE set up for editing .py files. I like VS code, with the Python and Pylance extensions. They make Python development a lot more enjoyable

Creating a library

Now, creating a Python library is very useful, but a bit of a rabbit hole. You’ll learn a lot more about Python while doing so, and will make your life a lot easier later on, but it’ll take some time before you’re properly comfortable with it. But it’s worth it!

So, in your current work folder (eg. my-project), create a folder with the name of whatever you want your library to be named. Let’s go with mytoolbox. Inside it, create files called __init__.py and utils.py and fill them with some content. I using the following commands, but you can manually copy in the function definitions using a text editor if you don’t run linux:

alex@laptop $ mkdir mytoolbox
alex@laptop $ echo "
def hello(name):
    return 'hello ' + name

def goodbye(name):
    return 'goodbye ' + name" > mytoolbox/utils.py
alex@laptop $ echo "
from . import utils
from .utils import hello" > mytoolbox/__init__.py

Your folder structure should look something like:

my-project/
    notebook-on-the-go.ipynb
    a-good-notebook-name.ipynb
    
    mytoolbox/
        __init__.py
        utils.py

Now, from your work folder (the one which contains the mytoolbox folder) launch a Python session and import your library. Let’s also alias it to make it easier to use:

>>> import mytoolbox as mtb
>>> mtb.hello('alex')
'hello alex'
>>> mtb.utils.hello('alex')
'hello alex'
>>> mtb.utils.goodbye('alex')
'goodbye alex'
>>> mtb.goodbye('alex')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'mytoolbox' has no attribute 'goodbye'

In particular, notice how the hello function was available via both mtb.hello and mtb.utils.hello - that’s because we imported it in __init__.py, making it accessible in mytoolbox directly, as if there was a mytoolbox.py file. goodbye wasn’t imported into __init__.py, so it is only accessible in the mytoolbox.utils namespace

That’s just a tiny intro to namespace/file/folder layout in Python - enough to get you started on a mini library, and so that the examples that follow make sense

Typing

Pylance adds some type checking capabilities to python, which I highly recommend using to get the most out of an IDE. The tiny summary is that you can add “type hints” to Python to make your code more understandable, and to more easily catch some types of bugs. As an example, let’s adjust the hello function from before:

def hello(name: str) -> str:
    return 'hello ' + name

Now it’s more clear that the function takes a string argument, and returns another string. Instant documentation! And pylance will use those annotations to help you catch bugs. Check out mypy and the pylance documentation for more info on how to add type hints to your code

Confusingly, Pylance comes with a type checker but it’s deactivated by default - enable it by navigating to “Settings” > “Extensions” > “Pylance” > “Python > Analysis: Type checking mode” and changing it to “basic”

Combining jupyter lab and VS code

At this point, you have some library code in your mytoolbox folder, and some notebooks where you’re doing your work. Your folder structure is the same as what I wrote above. If you open a jupyter notebook, you should be able to import your library/toolbox using the same commands you used to access it in the REPL

Now, code you use a lot but don’t often change should almost always go in the toolbox. As you copy functions over, you’ll find parts of your code where you eg. accidentally relied on global variables, hopefully resulting in more robust code/fewer bugs. Code which you’re actively modifying or uncertain of should stay in notebooks

My workflow is generally something like:

test out a code snippet, check documentation, and generally explore your problem in a notebook:

import numpy as np
help(np.convolve)

import matplotlib.pyplot as plt
plt.plot(np.convolve(a, v))
# great, it works!

before actually using the new code, I generally refactor it into a function (still defined in a notebook cell):

def run_the_algorithm(v: list):
    # woops, accidentally relied on global variable `a` without realising!
    # that's okay for now, since I'm just prototyping and still figuring
    # things out
    plt.plot(np.convolve(a, v))

run_the_algorithm(v)

At this point it stays in my notebook for a while

once the function/algorithm/whatever is understood and doesn’t change too much, I move it into my library (and possibly get a NameError: name 'a' is not defined, or have my type checker find it for me). Finally, I call it from my notebook:
```
import mytoolboox as mtb
mtb.visualization.run_the_algorithm(a, v)
```

In short, use each IDE (jupyter lab/VS code) for the type of coding they’re more suited to!

Autoreloading changed code

Unfortunately, if you modify your library/toolbox after you’ve imported it, you’ll find that your modifications don’t register until you shut down your Python kernel, restart it, and then re-import the library. It’s incredibly confusing if there’s a difference between the code you see in your library, and the code loaded in the current Python instance

Luckily, there’s a better way: the autoreload ipython magic command. The documentation is pretty clear, so go ahead and read that quick

In each notebook, you’d enter something like the following in the top cell:

# load the autoreload extension, and change to setting number 2
%load_ext autoreload
%autoreload 2

# we want this toolbox to reload every time we change a file inside it
import mytoolbox as mtb

# import whatever libraries you don't need to automatically reload
%aimport sympy

Continuing with toolbox we’re writing in this tutorial, type print(mtb.hello('Alex')) to check that it works. Now, using VS code (or any other text editor) modify hello to take a second parameter as follows:

def hello(name: str, age: int) -> str:
    return f'hello {name}, you are {age} years old'

Back in jupyter, write mtb.hello('Alex', 24) and the updated function should auto-magically run!

Another possible workflow

Some people disagree with the sentiment I’ve laid out in this post! Jeremy Howard makes a strong argument in this post for sticking with a workflow which involves only notebooks. I really do recommend reading through that second link - it addresses the shortcomings of notebooks in a somewhat different way to this post. Both approaches keep notebooks as the centre stage for interactive development, with the difference mainly being that Jeremy prefers to keep code in .ipynb files (and generate .py files) while I prefer to manually factor out code when it makes sense

My (hopefully not too subjective) comparison of the autoreload + refactor approach outlined in this blog vs. the nbdev workflow is as follows:

Same
- The central similarity: exploratory development happens in jupyter notebooks
Pros of autoreload + refactor
- Use IDE features for established code. I really like python’s type hinting, so this works out really well for me! Refactoring is made a lot easier
- The workflow easily understood, even without looking at this blog post
- You gain experience with both notebooks and IDEs, which may be important for a job
- You use each development environment where they’re best
Cons of autoreload + refactor
- If you do need to update old code (which happens a lot with research code - it’s usually not clear how your initial assumptions will pan out) you’ll have lost out on “explaining through examples by recording a session of interaction”. You can try recreate this but I find it simply isn’t the same
- You need to update function call sites for refactored code
- The IDE features aren’t nearly as good if you don’t have type hints, where the nbdev approach will always give you auto-completion. If your research involves a lot of work with a 3rd party untyped library, you’ll either need to make a type stub for it or face a worse development experience

It’s of course up to you to decide which approach you prefer, and which makes sense for a particular project

One great result of doing this, is that you can easily share your library with others. If at some point you want to be able to pip install your library, I highly recommend checking out flit. The trajectory optimization library I developed as part of my MSc has quite a simple structure, so you may find it useful as a template

Closing thoughts

I find that this workflow works particularly well when you’re focussing on one project/domain for a long time. It’s not hard to set up, although creating a “proper Python package” involves a surprisingly high learning curve. The skills you learn are also all quite useful for a range of things. Good luck, and let me know if there’s anything I can improve!

Python and Pylance extensions#

Creating a library#

Typing#

Combining jupyter lab and VS code#

Autoreloading changed code#

Another possible workflow#

Sharing your library#

Closing thoughts#