Phase 2 Gestures

title: Phase 2 Gestures
dated: Fall 2023
related: Pens & Hands

Our prototype is built on the web platform — written in TypeScript, rendered with SVG. It runs inside a tiny Swift iOS app, which loads our compiled JS into a webview. The Swift app captures all incoming touch and pencil events and forwards them to the JS context. This arrangement allows for simultaneous capture of pencil and touch events, which (last we checked) is not possible in the browser.

Every time we render a frame, we process the queue of new input events since the last frame. Each event is passed through a gestural input system, which was designed to satisfy the following goals:

Gestures need to be easy to code. We want to quickly whip up new ones or try variations.
The default behaviour should be the best-feeling behaviour.

Traditional approaches to input handling make it hard to satisfy these goals, so we rejected them:

The browser’s element event APIs (i.e. capture/bubble) require careful handling a lot of state. You need to manually add and remove listeners. You need to store any state that changes over time outside the listeners. (We’d also need to do some extra busywork to use these APIs, since our events are dispatched from Swift, not from the DOM.)
“Delegate” event handling (popularized by jQuery and game engines) makes the above more flexible by not requiring you to add/remove listeners on individual elements. Instead, you can bring your own event routing logic. Most common approaches to event routing for realtime continuous graphics apps (such as immediate-mode in games) lead to conditional-heavy code with high cyclomatic complexity. These approaches benefit from being very explicit, and it’s easy to accrete new gestures — just add another if. But, each additional conditional makes the whole system harder to reason about, which makes it hard to modify existing logic. Also, you still have to find a home for any state that changes over time.
Dataflow or FRP/Elm architecture improve on the above by introducing some indirection. For instance, instead of a big chain of conditionals, you turn input events ("mousedown") into domain events ("create circle at (x,y)") and then write handlers for these finer-grained events. But this approach suffers from visibility problems. The relationship between gestures becomes implicit, easy to add new gestures and modify existing ones, but hard to tell when gestures will conflict (eg: when elements on the canvas overlap, which one gets clicked, and where is that controlled?)
Alternatively, state machines / charts meaningfully address the complexity of the “delegate” approach by introducing structure that maps well to the problem domain, and (ideally) by visualizing that structure. But there’s no support for state machines at the language level in TypeScript, and off-the-shelf libraries (like XState) require buying in to an ecosystem that carries a lot of bad design influence from React, solving the wrong problems.

Ivan designed an approach to gesture handling that borrows from the above, but with a few key differences. This approach also takes careful advantage of JavaScript closures to dramatically simplify state maintenance.

Briefly, each gesture supported by the system is listed in an array, so that we can tell at a glance which gestures exist and in what order they’ll be evaluated (and thus, which gesture will win any conflicts). Gestures are implemented using functions that check the state of the world plus the current event, and return a new Gesture instance when conditions are right. The Gesture instance is created with closures for each relevant event phase, and then the instance is cleaned up automatically when the current touch ends.

Our previous “delegate”-esq approach. Notice the boilerplate needed to handle state — the objects.touchedMetaToggle object needs to be created when the gesture begins, and cleaned up when the gesture ends. Notice the need for nested conditionals (and note that this handleMetaToggleFingerEvent function is itself called from a nested conditional).

The new approach. Notice that the metaToggle variable is used both for the if (metaToggle) check, and for effects inside the gesture. And thanks to the use of closures, no manual state cleanup is needed. Also note that the Gesture instance is given the name "Touching Meta Toggle", which can be rendered to the canvas to aid debugging. And of course, note that even with comments, this code is dramatically shorter but totally equivalent to the previous approach. You mostly see the effects of the gesture, as opposed to boilerplate machinery.

Easy to code

Creating a new gesture should require very little boilerplate. You want to be able to just write “when the person does this little move with their finger meat to that weird little blob of graphics, do this cool computer stuff.” You should not need to say “when the person begins touching this object, add a new touchmove listener and create a variable to store the original position so that we can diff new positions against it, and then when the distance delta exceeds a certain threshold switch to a different listener on a different object, but if at any point the touch ends clean up these listeners and the position variable”. In other words:

You should be able to write gesture handling code that creates state without having to worry about cleaning up that state later.
There’s a common set of data that is frequently used by gestures, like position deltas, which should be tracked and exposed conveniently and automatically to gesture handling code.
Writing a gesture should feel like writing immediate-mode GUI code, rather than retained-mode. In other words, you shouldn’t have to add and remove listeners.

As demonstrated in the screenshots above, the new gesture system satisfies the goal of being easy to code.

Best-feeling defaults

Many common approaches to input handling do a bad job of allowing simultaneous gestures to overlap. This could be because most UI code patterns date back to an era where all input came from a mouse and keyboard. Keyboard input is binary, and different keys tend to be semantically non-overlapping. An exception that proves the rule comes up when handling, say, WASD directional input:

if key.a? then goLeft()
else if key.d? then goRight()

This code will behave poorly if both A and D are pressed at the same time. But this sort of overlap tends to be the exception rather than the norm. Further, the keyboard and mouse tend to be non-overlapping, with the mouse performing freeform spatial input or directly manipulating on-screen GUI elements, and the keyboard invoking shortcuts or performing structured text input. Even the mouse buttons tend to be non-overlapping, with each button performing wildly different actions when clicked.

Gesture input in our system needs to be wildly overlapping. For one, we want the option to explore all sorts of different ideas for gestures, so we can’t draw any easy divisions like pencil and touch are non-overlapping. But even if we could, finger inputs need to be able to overlap. If there are 4 draggable handles on screen, you should be able to drag any one of them at any time, or all 4 of them at the same time. You should be able to drag them regardless of whatever else is happening. Sadly, most iOS apps fail miserably at this — you mostly use them with one finger that acts like a mouse pointer (except it doesn’t have a hover state).

The new gesture system allows for gestures to overlap by default. Understanding how this works requires understanding the machinery that powers the gesture system, which is out of scope for this doc, but here’s the short version. We refer to the pencil, and each finger, as a separate touch. Each touch is associated exactly one-to-one with a gesture instance. The gesture instance encapsulates all the state needed to handle this one touch. Every step involved in creating gestures happens in isolation. This means we can have as many simultaneous touches as we want. But, it carries a downside: gestures are only allowed to interact in extremely limited ways. For example, to make pseudo-modes work with the pencil, finger touches on the empty canvas aren’t treated as a gesture. So, this downside was chosen because it doesn’t restrict the sort of gestural input we’re interested in, and because it’s not too difficult to find careful workarounds for if two gestures ever do need to interact.

The Dataflow / FRP / Elm approach similarly supports this sort of overlapping gestures. The other approaches described above can support it, but the coding patterns they encourage are actively hostile toward this sort of overlap, so you have to write your code very carefully to make it work. Whereas with our gesture system, the best-feeling behaviour is the default.