The Jupyter Cinematic Universe

A constellation of somewhat-compatible technologies from which we can extract a compromise between ease of 1) actually doing data science and 2) seeming to laypeople to be doing data science

February 9, 2017 — September 5, 2023

faster pussycat

premature optimization

python

The python-derived entrant in the scientific workbook field is called jupyter.

Interactive “notebook” computing for various languages; python/julia/R/whatever plugs into the “kernel” interface. Jupyter allows easy(ish) online-friendly worksheets, which are both interactive and easy to export for static online use. This is handy. Handy enough that it’s sometimes worth the many rough spots, and so I conquer my discomfort and use it.

1 Why jupyter?

What does jupyter buy us? Is it worth the set-up time configuring this contraption?

It took me a long time to realise that part of the answer to the first question is that jupyter is a de facto standard for running remote computation jobs interactively. The browser-based, network-friendly jupyter notebook is a natural, easy way to execute tedious computations on some other computer somewhere else, with some kind of a paper trail. In particular, it is much better over unreliable networks than are remote terminals or remote desktops, because the client/server architecture doesn’t need to do so many round-trips to get the code output. A good feature of jupyter (maybe its best) is that it is is a kind of re-designed network terminal. Certainly, if we need to execute a job that can be executed over remote desktop or jupyter, jupyter is going to be less awful if our connection has any lag at all, when every mouse click and keystroke involves waiting and twiddling your fingers.

What else? People make UX arguments, e.g. that jupyter is friendly and supports interactive plots and so on. I am personally ambivalent about those arguments. Jupyter can do some things better than the console. But that is an artificially restricted comparison. It can also do some things better than pencil and paper. On the other hand, most things that jupyter does, it does worse than a proper IDE or decent code editor. This comparison is more pertinent when you need to run the code in a However, sometimes those other things are not available on, say, your HPC cluster or cloud compute environment, and then this becomes a relevant advantage. Usually though, we can install VS Code Remote, unless we have angered the sysadmins.

But for now the main takeaway, I think, is that if, like me, you are confused by jupyter enthusiasts claiming it is “easy” or “fun”, it will make more sense if you append “in comparison some other horrible thing which I was forced to use by ignorance or cicumstance.”

There are other comparisons to make — some like it as a documentation format/literate coding environmennt. Once again, sure, it is better than text files. But then, RMarkdown is better.

We can now answer “Is it worth it?” with “Depends on the alternatives. Jupyter is adequate, and commonly available.”

2 Python-specific

See IPython.

3 Engineering pains

tl;dr Not to besmirch the efforts of the jupyter developers who are doing a difficult thing, in many cases for free, but I will complain about jupyter notebook with the jutification that it is best to go into these things with your eyes open. Jupyter is often touted as a wonderful solution for data science which makes stuff generally easier, but seems to me to merely offer a different selection of pain points to traditional methods, where the pain points are often surprising and novel, which does not make them better.

I’m an equivocal advocate of the jupyter notebook interface, which some days seems to counteract every plus with a minus, but at least is a pretty easy starting point. This is partly due to the particulars of jupyter’s design decisions, and partly because of the problems of notebook interfaces generally (Chattopadhyay et al. 2020). As with so many computer interfaces, my luke-warm endorsement is, in relative terms, fannish enthusiasm because often, as presaged, the alternatives are abysmal.

Jupyter: It’s friendly to use, but hard to install. It’s easy to graphically explore your data, but hard to keep that exploration in version control. It makes it easy to explore my code’s graphical output, but clashes with the fancy debugger that would make it easy to explore my code bugs. It is open source, and written in an easy scripting language, python, so it seems it should be easy to tweak to taste. In practice it’s an ill-explained spaghetti of python, javascript, compiled libraries and browsers that relate to one another in obscure ways that few people with a day job have time to understand or contribute to. There have been so many reboots, rewrite, re-architectures, and re-organisations that it’s hard to know what is going on. Each line of development takes place in a separate timeline in an extended cinematic multiverse, wherein the writer’s rooms are occasionally merged, but the timelines are never reconciled. Things regularly break either at the server or client side and I might need to upgrade either or both to fix it. I might have many different installs of each and need to upgrade a half-dozen different installs to keep them all working, because jupyter is deeply intertwined with the python packaging hell. It claims to be extensible but if I use any extensions, it is a constant struggle to keep jupyter finding the many intricate dependencies that are needing to keep the entire contraption running/ The sum total is IMO no more easy to run than most of the other UI development messes that we tolerate in academic software, let alone tweak. Case study: look a dependency of a dependency of the autocomplete function broke something and thus spawned a multi-month confusion of cascading problems and cost me several hours to fix across the few dozen different python environments I manage across several computers. This kind of tedious intermittent breakage is the much the cost of doing business with jupyter, and has been so for as long as I have been using the project, which is as long as it has existed.

These pain points are perhaps not so intrusive for projects of small-to-intermediate complexity and/or longevity. Indeed, jupyter seems good at making quick data science projects look smooth, shiny, and inviting. That is, at the crucial moment when I need to make my data science project look sophisticated-yet-friendly, it lures colleagues into my web(-based IDE). Then it is too late mwhahahahah you have fallen into my trap now you are committed you had better fund budget to maintain this mess. This entrapment might be a feature not a bug, as far as the realities of team dynamics and their relation to software development and organisational support. We want to lure people in until our problems become their problems, because a problem shared is a problem halved, right? Also shared trauma is a bonding experience.

Some argue that the weird / irritating constraints of jupyter can even lead to good architecture. See Guillaume Chevallier and Jeremy Howard. This sounds like an interactive twist on the old test-driven-development rhetoric. I could be persuaded of its merits, if I found time in between all the debugging.

I think of the famous adage “The fastest code is code that doesn’t need to run, and the best code is code you don’t need to write”. The uncharitable corollary might be “Thus, let’s make writing code horrible so that you write less of it”. That is not even necessarily a crazy position, and if that is what Guillame and Jeremy are saying, I guess I’ll take it.

Here is some verbiage by Will Crichton which explores some of these themes, The Future of Notebooks: Lessons from JupyterCon.

4 Terminology

Pain point: The lexicon of jupyter is confusing. Terminology tarpit alert.

A notebook is on one hand a style of interface, to which jupyter conforms to one interpretation of. Other applications with a notebook style of interface are Mathematica and MATLAB.

Jupyter interfaces communicate with a computational backed, which is called a kernel¹

These are software packages in which a unit of development is a type of notebook file on your disk, containing both code and output of that code. (In the case of jupyter this file format is marked by file extension .ipynb, which is short for “ipython notebook”, for fraught historical reasons.) One implementation of a notebook frontend interface over a notebook protocol for jupyter is called the jupyter notebook, launched by the jupyter notebook command which will open up a javascript-backed notebook interface in a web browser. This is the one that is usually assumed. Another common notebook-style interface implementation is called jupyter lab, which additionally uses much of the same jupyter notebook infrastructure but is distinct and only sometimes interoperable in ways which I do not pretend to know in depth. But there are multiple ‘frontends’ besides which interact over the jupyter notebook protocol to talk to a kernel.

Which sense of notebook is intended you have to work out from context, e.g. the following sentence is not at all tautological:

You dawg, I heard you like kernels and notebooks, so I started up your kernel jupyter notebook kernel in jupyter notebook

5 Jupyter as UI

See jupyter UI.

6 Front ends

See jupyter front ends.

7 jupyter kernels

Jupyter kernels now come in (at least) 2 flavours. I do no know to what extent they are interchangeable. Classic flavour is ipykernel which is shambolic but works. xeus is a new entrant.

7.1 ipykernel

7.1.1 Custom ipykernels

jupyter looks for kernel specs in a kernel spec directory, depending on my platform.

Say my kernel is dan; then the definition can be found in the following location:

Unixey: ~/.local/share/jupyter/kernels/dan/kernel.json
macOS: ~/Library/Jupyter/kernels/dan/kernel.json
Win: %APPDATA%\jupyter\kernels\dan\kernel.json

See the manual for details.

How to set up jupyter to use a virtualenv (or other) kernel? tl;dr Do this from inside the virtualenv to bootstrap it:

pip install ipykernel
python -m ipykernel install --user --name=my-virtualenv-name

Addendum: for Anaconda, we can auto-install all discoverable conda envs, which worked for me, whereas the ipykernel method did not.

conda install nb_conda_kernels

7.1.2 Custom kernel lite

e.g. if I wish to run a kernel with different parameters, for example with a GPU-enabled launcher. See here for a worked example for GPU-enabled kernels:

For computers on Linux with optimus, you have to make a kernel that will be called with optirun to be able to use GPU acceleration.

I made a kernel in ~/.local/share/jupyter/kernels/dan/kernel.json and modified it thus:

{
    "display_name": "dan-gpu",
    "language": "python",
    "argv": [
        "/usr/bin/optirun",
        "--no-xorg",
        "/home/me/.virtualenvs/dan/bin/python",
        "-m",
        "ipykernel_launcher",
        "-f",
        "{connection_file}"
    ]
}

Any script called can be set up to use CUDA but not the actual GPU, by setting an environment variable in the script, which is handy for kernels. So this could be in a script called noprimusrun:

CUDA_VISIBLE_DEVICES= $*

7.2 Xeus

A new Python kernel for Jupyter.

Long story short:

xeus-python is a lot lighter than ipykernel, which makes it a lot easier to implement new features on top of it.

xeus-python already works with the Jupyter Lab debugger

xeus-based kernels are more versatile in that one can overload e.g. the concurrency model. This is something that Kitware’s SlicerJupyter project takes advantage of to integrate with the Qt event loop of their Qt-based desktop application.

8 Jupyterlite

JupyterLite

JupyterLite is a JupyterLab distribution that runs entirely in the browser built from the ground-up using JupyterLab components and extensions.

9 Hosting static jupyter notebooks on the web

Various options. For one, github will attempt to render jupyter notebooks in github repos.; I have had various glitches and inconsistencies with images and equations rendering in such notebooks. Perhaps it is better in…

The fastest way to share your notebooks - announcing NotebookSharing.space - Yuvi Panda

You can upload your notebook easily via the web interface at notebooksharing.space:

Once uploaded, the web interface will just redirect you to the beautifully rendered notebook, and you can copy the link to the page and share it!

Or you can directly use the nbss-upload commandline tool: …

When uploading, you can opt-in to have collaborative annotations enabled on your notebook via the open source, web standards based hypothes.is service. You can thus directly annotate the notebook, instead of having to email back and forth about ‘that cell where you are importing matplotlib’ or ‘that graph with the blue border’. This is one of the coolest features of notebooksharing.space.

10 Hosting live jupyter notebooks on the web

Jupyter can host online notebooks, even multi-user notebook servers — if you are brave enough to let people execute weird code on your machine. I’m not going to go into the security implications. tl;dr encrypt and password-protect that connection. Here, see this jupyterhub tutorial.

10.1 Commercial notebook hosts

NB: This section is outdated. 🏗; I should probably mention the ill-explained Kaggle kernels and google cloud ML execution of same, etc.

Base level, you can run one using a standard a standard cloud option like buying compute time as a virtual machine or container, and using a jupyter notebook for their choice of data science workflow.

Kaggle Kernels are somehow also kaggle notebooks now or something? Anyway, it seems to execute code.
Paperspace - Gradient Notebooks

Gradient Notebooks is a web-based Jupyter IDE with free GPUs & IPUs.
sagemath runs notebooks online as part of their cocal service, with fancy features. There is a free tier and unspecified other pricing. Messy design but tidy open-source ideals.
Anaconda.org appears to be a python package development service, but they also have a sideline in hosting notebooks. ($7/month) Requires you to use their anaconda python distribution tools to work, which is… a plus and a minus. The anaconda python distro is simple for scientific computing, but if your hard disk is as full of python distros as mine is you tend not to want more confusing things and wasting disk space.
Microsoft’s Azure notebooks

Azure Notebooks is a free hosted service to develop and run Jupyter notebooks in the cloud with no installation. Jupyter (formerly IPython) is an open source project that lets you easily combine markdown text, executable code (Python, R, and F#), persistent data, graphics, and visualizations onto a single, sharable canvas called a notebook.
Google’s Colaboratory is hip now

Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.

Here is an intro and here is another

11 Pro tips and gotchas

11.1 Meta tips

Anne Bonner’s Tips, Tricks, Hacks, and Magic: How to Effortlessly Optimize Your Jupyter Notebook is actually full of useful stuff. So much stuff that upon reading it, I nearly forget I my past traumas with jupyter notebooks. If you must use jupyter, read her article and it will make stuff seem better. Many tips on this page I gleaned from her work

11.2 boilerplate

%%writefile basic_imports.py
%load basic_imports.py

11.3 Run a notebook without the jupyter server

See jupyter command line.

11.4 Offline MathJax in jupyter

e.g. for latex free mathematics.

python -m IPython.external.MathJax /path/to/source/MathJax.zip

11.5 I can’t see part of the cell!

Sometime, you can’t see the whole code cell; part of it overflows into some weird hidden alternate dimension. This is a known issue to do with vanishing scrollbars. The workaround is simple enough:

zooming out to 90% and zooming back in to 100%, Ctrl + - / +

11.6 IOPub data rate exceeded

You got this error and you weren’t doing anything that bandwidth intensive? Say, you were just viewing a big image, not a zillion images? It’s jupyter being conservative in version 5.0:

jupyter notebook --generate-config
atom ~/.jupyter/jupyter_notebook_config.py

update the c.ServerApp.iopub_data_rate_limit to be big, e.g. c.ServerApp.iopub_data_rate_limit = 10000000.

This is fixed after 5.0.

12 Securing

Modern jupyter is suspicious of connections per default and will ask you either for a magic token or a password and thereafter, I think, encrypts the connection (?), which is sometimes what I want. Not always.

When I am in HPC hell, accessing jupyter notebooks through a double SSH tunnel, the last thing I need is to put a hat on a hat by triply securing the connection. This is now adding more points of failure without any additional benefit. Also, sometimes the tokens do not work over SSH tunnels for me and I cannot work out why. I think it is something about some particular jupyter version mangling tokens, or possibly failing to report that it has not claimed a port used by someone else (although it happens more often than is plausible for the latter case). CodingMatters notes that the following invocation will disable all jupyter-side security measures:

$ jupyter notebook --port 5000 --no-browser --ip='*' --ServerApp.token='' --ServerApp.password=''

Obviously never do this unless you believe that everyone sharing a network with that machine has your best interests at heart. The best way to ensure that is to be accessing a machine through a firewall to a locked-down port.

There are various other useful settings which one could use to reduce security. In config file format for ~/.jupyter/jupyter_notebook_config.py:

c.ServerApp.disable_check_xsrf = True #irritates ssh tunnel for me that one time
c.ServerApp.open_browser = False # consumes a 1 time token and is pointless from a headless HPC
c.ServerApp.use_redirect_file = False # forces display of token rather than writing it to some file that gets lost in the containerisation and is useless in headless HPC
c.ServerApp.allow_password_change = True # Allow password setup somewhere sensible.
c.ServerApp.token = '' # no auth needed
c.ServerApp.password = password # actually needs to be hashed - see below

Eric Hodgins recommends this hack for a simple password without messing about trying to be clever with their browser infrastructure (which TBH does seem to break pretty often for me):

c = get_config()
c.ServerApp.ip = '*'
c.ServerApp.open_browser = False
c.ServerApp.port = 5000

# setting up the password
from IPython.lib import passwd
password = passwd("your_secret_password")
c.ServerApp.password = password

13 As a proxy

jupyter-server-proxy:

Jupyter Server Proxy lets you run arbitrary external processes (such as RStudio, Shiny Server, syncthing, PostgreSQL, etc) alongside your notebook, and provide authenticated web access to them.

Note

This project used to be called nbserverproxy. if you have an older version of nbserverproxy installed, remember to uninstall it before installing jupyter-server-proxy - otherwise they may conflict

The primary use cases are:

Use with JupyterHub / Binder to allow launching users into web interfaces that have nothing to do with Jupyter - such as RStudio, Shiny, or OpenRefine.

Allow access from frontend javascript (in classic notebook or JupyterLab extensions) to access web APIs of other processes running locally in a safe manner. This is used by the JupyterLab extension for dask.

14 References

Chattopadhyay, Prasad, Henley, et al. 2020. “What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities.”

Granger, and Pérez. 2021. “Jupyter: Thinking and Storytelling With Code and Data.” Computing in Science Engineering.

Himmelstein, Rubinetti, Slochower, et al. 2019. “Open Collaborative Writing with Manubot.” Edited by Dina Schneidman-Duhovny. PLOS Computational Biology.

Millman, and Pérez. 2014. “Developing Open Source Scientific Practice.”

Otasek, Morris, Bouças, et al. 2019. “Cytoscape Automation: Empowering Workflow-Based Network Analysis.” Genome Biology.

Sokol, and Flach. 2021. “You Only Write Thrice: Creating Documents, Computational Notebooks and Presentations From a Single Source.” In.

Footnotes

because in mathematics and computers science if you don’t know what to call something you call it a kernel. This confusing explosion of definitions is very much on-message for the notebook development.↩︎