Provenance for science papers, local-first access control

September 2024
Back to archive

It’s been a busy summer at Ink & Switch! In this Dispatch, we’ll introduce you to two new projects at the lab: exploring writing environments for science papers and local-first access control. We also have some updates on WASM packaging for Automerge, and a new researcher-in-residence.

Jacquard: Version control and provenance for empirical research

We believe one of the most important capabilities in creative tools is version control: helping people collaborate, review suggestions, and see what’s changed, in both synchronous and asynchronous editing situations. Last year we published Upwelling, a prototype of “draft layers” for asynchronous collaborative writing. Next, Patchwork built on that work to explore “universal version control”: powerful diffing and branching tools built not just for writing, but also drawings, spreadsheets, and other kinds of media.

This summer, Paul Sonnentag and Geoffrey Litt from the Patchwork team have teamed up with Josh Horowitz to explore universal version control in a new domain: scientific research papers.

We’ve heard from scientists in a variety of fields that their digital tools make it cumbersome to collaborate on data analyses and writing papers. One problem is that limited version control makes it difficult to review collaborators’ edits. Another issue is that writing and data analysis are managed in separate environments, which leads to tedious manual work stitching together data across tools.

A figure from an empirical astronomy paper

On this project, we’re prototyping Jacquard: a collaborative environment for writing empirical research papers. The goal is to free up researchers to focus more on their core work of science and communication, and less on tedious bookkeeping. (The name “Jacquard” comes from the automated loom that was an important step in the history of computing.)

Our first demo of Jacquard shows a collaborative editor that supports editing LaTeX files and Python files. It tracks a full provenance chain in order to help build all the source files needed to build an astronomy paper.

A provenance graph showing the steps involved in compiling an astronomy paper using Python and Latex

For more details on the demo and our goals for the project, check out the Jacquard lab notebook, where we’ll post further updates. And if you’re a scientist who works with data and struggles with collaboration, we’d love to hear from you—email geoffrey@inkandswitch.com.

Beehive: Local-first access control

Cloud based services provide excellent access control features allowing users fine grained control over who has read and write access to a document, as well as features like user groups and folders which implicitly grant access to their members and contents respectively. For local-first software to be successful we’ll need to be able to provide similar features without relying on a central server to enforce access control at the network boundary. In fact, we want servers to become simple interchangeable relays which only operate over encrypted data. The goal of the Beehive project is to design and build a production ready instance of such a system which is general enough for most applications.

To date the the local-first ecosystem has mostly used a purely pull-based model, which is often sufficient for personal projects: each user can manually decide which peers to connect to and which changes should be applied. On the other hand, many production contexts (i.e. corporatations, journalists, or even planning a surprise party) are lower trust, require higher alignment, and are ideally low touch enough so that it’s not up to each person in a large organization to separately and manually infer who to trust.

This naturally leads to questions like:

Who should the group members accept new edits from?
What is a specific user able to do to this document?
How to only share documents with some people but not others?
What to do if a previously trusted peer starts behaving badly?
What if an admin’s device is lost or stolen?

These are especially challenging in a local-first setting since there is no network boundary or central server to guard reads and writes. By its nature, local-first requires that any access control mechanism used must travel with the data itself and work without a central guard. There are also some tricky edge cases due to causal consistency. What should happen with content that’s later discovered to be malicious but honest ops depend on it causally? What is the best strategy to handle ops from an agent that was revoked concurrently (especially given that “backdating” operations is possible). If a document has exactly two admins (and many non-admin users), what should happen if the admins concurrently revoke each other (for instance, one is malicious)?

Recently, Brooklyn Zelenka and Alex Good (with significant input from Martin Kleppmann) have been hard at work building Beehive: a local-first access control system that seeks to address the above concerns. At a very high level, the current approach in Beehive is composed of three layers:

End-to-End Encryption (E2EE): With post-compromise security (PCS) and key management
A Group Management CRDT: Self-certifying, concurrent group management complete with coordination-free revocation
Convergent Capabilities: A new capability model appropriate for CRDTs, and sits between object- and certificate-capabilities

A Beehive document in isolation, with a simplified view of its stateful delegation graph

We’ve made substantial progress in designing the core data structures and algorithms, though a few open questions remain. We are currently refining our approach to address revocation edge cases, ensure causal stability under E2EE, balance forward security in operation-based CRDTs, and minimize trust in sync servers. As always, usability, space and performance are also top-of-mind.

Causal key management: a strategy for managing E2EE keys based on the causal structure of a document. Similar to a Cryptree, having the key to some encrypted chunk lets you iteratively discover the rest of the keys for that chunk's causal history, but not its parents or siblings.

It’s also worth mentioning another ongoing project at the lab focused on data synchronization for peer-to-peer and via sync servers that’s been headed up by Alex Good. We’ve realized that sync and secrecy strongly interact. Broadly speaking, sync protocols benefit from more metadata (to efficiently calculate deltas), whereas cryptographic protocols aim to minimize metadata exposure. This tension extends across related systems, including merging E2EE compressed chunks, and determining if a peer has already received specific operations when a sync server cannot access them in cleartext.

Fortunately, combining these systems can sometimes result in more than the sum of their parts. For instance, convergent capabilities help facilitate the calculation of which documents are of interest to particular agent, helping the sync system know which documents to send deltas of. For these reasons, we’re treating synchronization and authorization as part of a larger, unified project, even though each will yield distinct artifacts.

Automerge Anywhere

The Automerge team has made some big improvements to the WASM packaging setup for the library, which makes it usable in more contexts, including vanilla JS applications with no bundler, in React-Native applications on mobile devices, within cloud services like Cloudflare Workers or Val.town, and more.

For more details, see the full writeup on the Automerge blog.

Researchers-in-residence

Elliot joined as a researcher-in-residence this month. Elliot is working on researching tools that build bridges between diverse ways of specifying programs, from the esoteric to the mundane. While in residence, Elliot will be working on programming language prototypes to explore variations on concepts like destructuring and evaluation direction. Elliot is here to strengthen his research fundamentals, collaborate, and get feedback.

Lu Wilson published their essay about Arroost, a music-making tool. In the essay, they make the case for tackling emotional blockers when building creative tools.

What’s a few more open tabs?

From lab researcher Ivan Reese: 2222, a mysterious programming game
We wrote on the Patchwork notebook about Universal Comments: a general abstraction for working with annotations on any kind of document.
You might enjoy this excellent study on file system design by Dominic Giampaolo, lead creator of Be File System who now works on file systems at Apple.

That’s all for now, until next time.