Jacquard lab notebook

Version control and provenance for empirical research

🧪

Jacquard is a research project about better authoring environments for scientists.

In this lab notebook, we’ll share snippets of our findings as we explore the problem space and prototype potential solutions.

The entries start from the beginning, but you can jump to the most recent post: 02 · Tracking provenance

01 · Versioning and provenance for empirical research

2024 Aug 6

How do researchers write papers that contain data analysis and charts?

A figure from an empirical astronomy paper

Consider a hypothetical astronomy paper. Measurements are collected from a telescope and written into a raw data file, which is reduced into a smaller CSV, which is loaded into a Python analysis notebook, which outputs further intermediate data files, as well as several visualizations as PNGs, which are then dragged into a web-based collaborative LaTeX editor where the team edits the paper and refines the charts.

This cobbling-together of tools and intermediate data files can cause serious problems for teams of researchers:

Through conversations with researchers in fields ranging from astrophysics to oceanography, we’ve learned that these kinds of problems cause daily friction, stealing focus from important work and even causing mistakes. Some researchers use software engineering tools like Git and Makefiles to help with these problems, but those tools are an awkward fit for exploratory research programming, and aren’t easily accessible to scientists who are less familiar with the command line.

On this project, we’re prototyping Jacquard: a collaborative environment for writing empirical research papers. The goal is to free up researchers to focus more on their core work of science and communication, and less on tedious bookkeeping. (The name “Jacquard” comes from the automated loom that was an important step in the history of computing.)

Jacquard builds on years of work at Ink & Switch, including most recently Patchwork: a browser-based local-first collaboration environment with powerful version control utilities like branching and history views.

Patchwork explored features like diff views for reviewing text edits

We’re starting out by extending Patchwork to support the kinds of data needed by empirical research papers, like LaTeX files and data visualization scripts. From there, we aim to add on powerful capabilities like tracking provenance of derived artifacts or making suggestions on a branch. Throughout this process, the prototype should remain a simple web-based collaboration interface that’s accessible to researchers who don’t have prior experience with version control or build systems.

A couple notes about our process:

First, while we’re aiming to invent new collaboration workflows, supporting real science work requires integrating with existing tools as well. So we’ll be taking a pragmatic approach that meets scientists where they are, building bridges to existing desktop workflows and programs.

Second, while this work is related to efforts in scientific reproducibility, that’s not our top priority. Reproducibility efforts often focus on packaging up results once they’re completed; we’re more interested in supporting and accelerating the messy process of getting to the results in the first place.

We’ll be posting updates as we go on this blog. If you’d like to follow along, feel free to sign up for the Ink & Switch email newsletter to receive periodic emails about our progress. And if you’re an empirical researcher (in any discipline) and would like to talk with us about your experiences with these problems, we’d love to chat—please reach out at geoffrey@inkandswitch.com.

02 · Tracking provenance

2024 Aug 14

Scientific papers straddle two worlds. They’re thoughtfully crafted prose documents, but they’re also computational documents containing data analyses and visualizations. Today, the prose and computational parts of a paper often live in different environments and tools, which causes friction for teams of scientists.

For instance, one scientist told us that he keeps chart-generation code in a git repo, but collaborates with coauthors in the web-based Overleaf tool. Every time a chart changes, he needs to manually drag files from his computer into the browser. Meanwhile, his collaborators have little visibility into the code and data that generate charts.

Other scientists have told us about difficulties staying oriented as they move between a paper and the data pipelines that feed into it. A paper’s LaTeX source refers to a chart PNG file… but wait, which data file and Python script were used to generate that PNG?

We wondered: what if one collaboration environment could host both the text of the paper and the data visualization code, making it seamless to edit them together? Here’s a prototype:

The demo shows a web-based collaboration environment with provenance — information about how computed artifacts were generated from source material. By keeping track of provenance, we know when an output file needs to be rebuilt, and we know how to do it. We can also use provenance to create a map of the project:

The build graph shows how files affect one another.

A core challenge we’re exploring in this prototype is how to integrate a collaborative web-based editor with the full power of running arbitrary computations on a Unix shell.

Behind the scenes of this demo, there is a watcher process which can be hosted on any computer, like a scientist’s own laptop or a cloud server. When the user requests a rebuild from the web interface, the watcher process detects the request, re-runs commands like the Python script or the LaTeX compiler, and syncs the results as well as provenance information back to the web view.

We think the ability to run the watcher process on any computer provides useful flexibility. At first, a scientist can easily try out computations without needing to initially make them portable or run them in a Docker container. At the same time, it’s straightforward to introduce a more reproducible environment at any time.

Currently the provenance tracking is fairly manual and relies on user annotations when running commands, though we have developed some automation helpers for specific cases like a LaTeX build. In the long run, we’d like to explore tracing filesystem access to automatically determine which files are used by a command.

Takeaways

Two surprises so far from our experience with this prototype:

  1. Implicit build spec: “Build systems” like this usually have a “makefile” – a specification file that defines how outputs should be generated from inputs. But we realized during our design phase that the provenance information we were tracking made a makefile unnecessary. Once you’ve run a command to generate an output file, the implicit trace of provenance records the information that would ordinarily be explicitly written in a makefile. We’ve gotten feedback from scientists that this approach serves their needs well.

  2. Provenance as map: We created the build graph as a quick way to see which files are out of date, but we’ve also found it surprisingly useful as a “map” for navigating a project’s contents. It’s easy to find the final PDF at the bottom of the graph. Pipelines of input into the paper are automatically organized above. Although the simple graph we’re showing now might become a tangle in larger projects, we plan to explore more ways to let authors map out their projects through provenance relationships.

Prior art

Our thinking on provenance tracking has been inspired by prior work, in particular two papers:

FileWeaver: Flexible File Management with Automatic Dependency Tracking by Julien Gori, Han Han, and Michel Beaudouin-Lafon, shows a dependency graph view of build tasks over a Unix filesystem:

FileWeaver shows a dependency graph between files.

Burrito, by Philip Guo and Margo Seltzer, tracks research programming activites on a Linux machine, including showing provenance relationships between inputs and outputs.

Burrito shows relationships between input and output files.

Our prototype shares the general idea of this prior work: that tracking and visualizing provenance can help people manage projects with complex file structures. But we’re building in a collaborative local-first editor built on Automerge rather than a desktop Unix environment, which opens up new technical opportunities to reimagine features like versioning and provenance.

Thanks to Will Golay for providing the code for the sample paper in the demo.


The Ink & Switch Dispatch

Keep up-to-date with the lab's latest findings, appearances, and happenings by subscribing to our newsletter. For a sneak peek, browse the archive.