I was wrong. CRDTs are the future

I saw Martin Kleppmann’s talk a few weeks ago about his approach to realtime editing with CRDTs, and I felt a deep sense of despair. Maybe all the work I’ve been doing for the past decade won’t be part of the future after all, because Martin’s work will supersede it. Its really good.

Lets back up a little.

Around 2010 I worked on Google Wave. Wave was an attempt to make collaboratively editable spaces to replace email, google docs, web forums, instant messaging and a hundred other small single purpose applications. Wave had a property I love in my tools that I haven’t seen articulated anywhere: It was a general purpose medium (like paper). Unlike a lot of other tools, it doesn’t force you into its own workflow. You could use it to do anything from plan holidays, make a wiki, play D&D with your friends, schedule a meeting, etc.

Internally, Wave’s collaborative editing was built on top of Operational Transform (OT). OT has been around for awhile - the algorithm we used was based on the original Jupiter paper from 1995. It works by storing a chronological list for each document of every change. “Type an H at position 0”. “Type a i at position 1”. Etc. Most of the time, users are editing the latest version of the document and the operation log is just a list of all the changes. But if users are collaboratively editing, we get concurrent edits. When this happens, the first edit to arrive at the server gets recorded as usual. If the second edit is out of date, we use the log of operations as a reference to figure out what the user really intended. (Usually this just means updating character positions). Then we pretend as if thats what the user meant all along and append the new (edited) operation. Its like realtime git-rebase.

Once Wave died, I reimplemented the OT model in ShareJS. This was back when node was new and weird. I think I had ShareJS working before npm launched. It only took about 1000 lines of code to get a simple collaborative editor working, and when I first demoed it I collaboratively edited a document in a browser and from a native application.

At its heart, OT is a glorified for() loop with some helper functions to update character offsets. In practice, this works great. OT is simple and understandable. Implementations are fast. (10k-100k operations per second in unoptimized javascript. 1-20M ops/sec in optimized C.). The only storage overhead is the operation log, and you can trim that down if you want to. (Though you can’t merge super old edits if you do). You need a centralized server to globally order operations, but most systems have a centralized server / database anyway, right?

Centralized servers

The big problem with OT is that dependancy on a centralized server. Have you ever wondered why google docs shows you that weird “This document is overloaded so editing is disabled” thing when a document is shared to social media? The reason (I think) is that when you open a google doc, one server is picked as the computer all the edits run through. When the mob descends, google needs to pull out a bunch of tricks so that computer doesn’t becomes overwhelmed.

There’s some workarounds they could use to fix this. Aside from sharding by document (like google docs), you could edit via a retry loop around a database transaction. This pushes the serialization problem to your database. (Firepad and ShareDB work this way).

Its not perfect though. We wanted Wave to replace email. Email is federated. An email thread can span multiple companies and it all just works. And unlike facebook messenger, emails are only be sent to the companies that are CC’ed. If I email my coworker, my email doesn’t leave the building. For Wave to replace email, we needed the same functionality. But how can that work on top of OT? We got it working, kinda, but it was complex and buggy. We ended up with a scheme where every wave would arrange a tree of wave servers and operations were passed up and down the tree. But it never really worked. I gave a talk at the Wave Protocol Summit just shy of 10 years ago explaining how to get on the network. I practiced that talk, and did a full runthrough. I followed literally step by step on the day and the version I made live didn’t work. I still have no idea why. Whatever the bugs are, I don’t think they were ever fixed in the opensource version. Its all just too complicated.

The rise of CRDTs

Remember, the algorithm Wave used was invented in 1995. Thats a pretty long time ago. I don’t think I even had the internet at home back in 1995. Since then, researchers have been busy trying to make OT work better. The most promising work uses CRDTs (Conflict-Free Replicated data types). CRDTs approach the problem slightly differently to allow realtime editing without needing a central source of truth. Martin lays out how they work in his talk better than I can, so I’ll skip the details.

People have been asking me what I think of them for many years, and my answer was always something like this:

They’re neat and I’m glad people are working on them but:

They’re slow. Like, really slow. Eg Delta-CRDTs takes nearly 6 hours to process a real world editing session with a single user typing a 100KB academic paper. (Benchmarks - look for B4.)
Because of how CRDTs work, documents grow without bound. The current automerge master takes 83MB to represent that 100KB document on disk. Can you ever delete that data? Probably not. And that data can’t just sit on disk. It needs to be loaded into memory to handle edits. (Automerge currently grows to 1.1GB in memory for that.)
CRDTs are missing features that OT has had for years. For example, nobody has yet made a CRDT that supports /object move/ (move something from one part of a JSON tree to another). You need this for applications like Workflowy. OT handles this fine.
CRDTs are complicated and hard to reason about.
You probably have a centralized server / database anyway.

I made all those criticisms and dismissed CRDTs. But in doing so I stopped keeping track of the literature. And - surprise! CRDTs went and quietly got better. Martin’s talk (which is well worth a watch) addressed the main points:

Speed: Using modern CRDTs (Automerge / RGA or y.js / YATA), applying operations should be possible with just an log(n) lookup. (More on this below).
Size: Martin’s columnar encoding can store a text document with only about a 1.5x-2x size overhead compared to the contents themselves. Martin talks about this 54 minutes into his talk. The code to make this work in automerge hasn’t merged yet, but Yjs implemented Martin’s ideas. And in doing so, Yjs can store that same 100KB document in 160KB on disk, or 3MB in memory. Much better.
Features: There’s at least a theoretical way to add all the features using rewinding and replaying, though nobody’s implemented this stuff yet.
Complexity: I think a decent CRDT will be bigger than the equivalent OT implementation, but not by much. Martin managed to make a tiny, slow implementation of automerge in only about 100 lines of code.

I still wasn’t completely convinced by the speed argument, so I made a simple proof of concept CRDT implementation in Rust using a B-tree using ideas from automerge and benchmarked it. Its missing features (deleting characters, conflicts). But it can handle 6 million edits per second. (Each iteration does 2000 edits to an empty document by an alternating pair of users, and that takes 330µs. So, 6.06 million inserts / second). So that means we’ve made CRDTs good enough that the difference in speed between CRDTs and OT is smaller than the speed difference between Rust and Javascript.

All these improvements have been “coming soon” in automerge’s performance branch for a really long time now. But automerge isn’t the only decent CRDT out there. Y.js works well and kicks the pants off automerge’s current implementation in the Y.js benchmarks. Its missing some features I want, but its generally easier to fix an implementation than invent a new algorithm.

Inventing the future

I care a lot about inventing the future. What would it be ridiculous not to have in 100 years? Obviously we’ll have realtime editing. But I’m no longer convinced OT - and all the work I’ve done on it - will still be around. I feel really sad about that.

JSON and REST are used everywhere these days. Lets say in 15 years realtime collaborative editing is everywhere. Whats the JSON equivalent for realtime editing that anyone can just drop in to their project? In the glorious future we’ll need high quality CRDT implementations, because OT just won’t work for some applications. You couldn’t make a realtime version of Git, or a simple remake of Google Wave with OT. But if we have good CRDTs, do we need good OT implementations too? I’m not convinced we do. Every feature OT has can be put in to a CRDT. (Including trimming operations, by the way). But the reverse is not true. Smart people disagree with me, but if we had a good, fast CRDT available from every language, with integration on the web, I don’t think we need OT at all.

OT’s one advantage is that it fits well in centralized software - which is most software today. But distributed algorithms work great in centralized software too. (Eg look at Github). And I think a really high quality CRDT running in wasm would be faster than an OT implementation in JS. And even if you only care about centralized systems, remember - Google runs into scaling problems with Google Docs because of OT’s limitations.

So I think its about time we made a lean and fast CRDT. The academic work has been mostly done. We need more kick-ass implementations.

Whats next

I increasingly don’t care for the world of centralized software.
Software interacts with my data, on my computers. Its about time my software reflected that relationship. I want my laptop and my phone to share my files over my wifi. Not by uploading all my data to servers in another country. Especially if those servers are financed by advertisers bidding for my eyeballs.

Philosophically, if I modify a google doc my computer is asking Google for permission to edit the file. (You can tell because if google’s servers say no, I lose my changes.) In comparison, if I git push to github, I’m only notifying github about the change to my code. My repository is mine. I own all the bits, and all the hardware that houses them. This is how I want all my software to work. Thanks to people like Martin, we now know how to make good CRDTs. But there’s still a lot of code to write before local first software can become the default.

So Operational Transform, I think this is goodbye from me. We had some great times. Some of the most challenging, fun code I’ve ever written was operational transform code. OT - you’re clever and fascinating, but CRDTs can do things you were never capable of. And CRDTs need me. With some good implementations, I think we can make something really special.

I mourn all the work I’ve done on OT over the years. But OT is no longer fits into the vision I have for the future. CRDTs would let us remake Wave, but simpler and better. And they would let us write software that treats users as digital citizens, not a digital serfs. And that matters.

The time to build is now.

Discussion on HN