01 · Versioning and provenance for empirical research

How do researchers write papers that contain data analysis and charts?

A figure from an empirical astronomy paper

Consider a hypothetical astronomy paper. Measurements are collected from a telescope and written into a raw data file, which is reduced into a smaller CSV, which is loaded into a Python analysis notebook, which outputs further intermediate data files, as well as several visualizations as PNGs, which are then dragged into a web-based collaborative LaTeX editor where the team edits the paper and refines the charts.

This cobbling-together of tools and intermediate data files can cause serious problems for teams of researchers:

Through conversations with researchers in fields ranging from astrophysics to oceanography, we’ve learned that these kinds of problems cause daily friction, stealing focus from important work and even causing mistakes. Some researchers use software engineering tools like Git and Makefiles to help with these problems, but those tools are an awkward fit for exploratory research programming, and aren’t easily accessible to scientists who are less familiar with the command line.

On this project, we’re prototyping Jacquard: a collaborative environment for writing empirical research papers. The goal is to free up researchers to focus more on their core work of science and communication, and less on tedious bookkeeping. (The name “Jacquard” comes from the automated loom that was an important step in the history of computing.)

Jacquard builds on years of work at Ink & Switch, including most recently Patchwork: a browser-based local-first collaboration environment with powerful version control utilities like branching and history views.

Patchwork explored features like diff views for reviewing text edits

We’re starting out by extending Patchwork to support the kinds of data needed by empirical research papers, like LaTeX files and data visualization scripts. From there, we aim to add on powerful capabilities like tracking provenance of derived artifacts or making suggestions on a branch. Throughout this process, the prototype should remain a simple web-based collaboration interface that’s accessible to researchers who don’t have prior experience with version control or build systems.

A couple notes about our process:

First, while we’re aiming to invent new collaboration workflows, supporting real science work requires integrating with existing tools as well. So we’ll be taking a pragmatic approach that meets scientists where they are, building bridges to existing desktop workflows and programs.

Second, while this work is related to efforts in scientific reproducibility, that’s not our top priority. Reproducibility efforts often focus on packaging up results once they’re completed; we’re more interested in supporting and accelerating the messy process of getting to the results in the first place.

We’ll be posting updates as we go on this blog. If you’d like to follow along, feel free to sign up for the Ink & Switch email newsletter to receive periodic emails about our progress. And if you’re an empirical researcher (in any discipline) and would like to talk with us about your experiences with these problems, we’d love to chat—please reach out at geoffrey@inkandswitch.com.

Next entry: 02 · Tracking provenance


The Ink & Switch Dispatch

Keep up-to-date with the lab's latest findings, appearances, and happenings by subscribing to our newsletter. For a sneak peek, browse the archive.