Email culture

I just got a forwarded email from my mum which used a gawdy purple size 18 text to say this, linking to some youtube video:

Human ingenuity knows no limits. Make sure speakers are on. Sit back and take it all in. Enjoy!

I want to ask mum whats up with the awful colourful text - honestly, why does someone start thumbing through text colours looking for something shouty and obnoxious? Why don't they just use plaintext?


But why should anyone use plaintext? Why do I prefer it? I'm sure I could invent some reasons, but its not really better. In truth, I prefer plaintext because thats just how we talk. Its cultural.

10 years ago, rich text in email was considered rude. You excluded mutt and pine users. So technical mailing lists stayed in plaintext, and all important discussions happened in 80 character terminals. I don't know if anyone I care about still uses text-based email clients. It doesn't really matter at this point. Now unnecessary styling in email feels like a 'foreign tribe' signal. When I see it, I think that this person probably doesn't have experience in mailing lists. And so I intuitively think my mum shouldn't use it because people will think less of her.

And thats totally wacky and wrong. My mum and her friends aren't in our internet tribes. They don't have to be, and even though our people invented email, we don't get decide how it gets used. The fact that my mum's social circle can invent their own social conventions is a great sign for email as a product. And anyway, its not like any of them uses mutt.

As always, culture moves on.

Secure Email

Sending email securely is such a mess. Even PGP isn't good enough because it leaks metadata about who I'm contacting, when, and how much I'm saying. I'm really bothered by that and I've been thinking about it a lot lately. I think my ideal setup would be something like this:

Suppose alice@a.example.com wants to email bob@b.example.com. Alice needs bob's PGP key, a.example.com's public key and b.example.com's public key.

Alice PGP encrypts her email to Bob, then encrypts that so it can only be read by b.example.com, then encrypts that so it can only be read by a.example.com.

When she sends her email to her SMTP server at a.example.com, her server can only read & decrypt enough to know that the message came from Alice and is intended for b.example.com. Her server does not know anything else about the email, including its final destination. Her SMTP server forwards the encrypted bundle to b.example.com.

b.example.com decrypts the message with their key, and only knows that the email came from a.example.com and is intended for Bob. The b.example.com server does not know that Alice sent the email.

Finally Bob receives the message and decrypts it using his PGP key. Bob can of course read everything, including who sent the message.

This system has the big advantage that snooping hardware at either a.example.com or b.example.com doesn't tell the NSA anything. (Just that Alice sent someone an email, or that Bob received an email).

They would need hardware at both endpoints to discover that Alice and Bob are even messaging each other. Further, if Alice and Bob are feeling particularly paranoid, once this infrastructure was in place it would be easy to TOR-style bounce the message through a few more intermediate mail servers to make snooping almost impossible. Once it was bounced through more locations, even if the NSA snooped on both endpoints, they wouldn't be able to match the messages together - They would just know that Alice sent an email and Bob received one.

Its a shame that email would need to be changed so much to implement this system. But long term, I think its something we should work towards.

Identity crisis on the web

I went to IndieWebCamp drinks the other night and chatted to those guys about my ideas for KeyKitten.

We ended up chatting about what your main identity should be on the web. The two candidates are your email address (user@domain.com) or a URL, which for us techie people is probably simply a domain. On a website, you can put an hCard which can list all of your secondary identity information anyway, like email address and twitter handle.

The big advantages of email are that

  • Everyone already has one, even my mother
  • Everyone remembers their email addresses already
  • My mum will probably never register her own domain

Email is the primary identity of Persona and Gravatar. If someone logs in to your service using persona, you get their email address, not their URL.

URLs on the other hand are more powerful. I can put actual content on my website, including whatever contact information I want and an avatar image. We can do that with email addresses too, but its sort of hacked in. Gravatar (well, libravatar) works by rewriting foo@example.com to example.com/avatar/<hash of foo@example.com>. Which is a big awful nasty looking hack. Using hCard, you can just have a link to your avatar image from your homepage. Of course, if I own josephg.com, its pretty easy to put an image at josephg.com/avatar/<hash> anyway. Its just kind of haphazard.

The other benefit of URLs that @tantek kept talking about is that people shouldn't be siloed, and an identity like josephg@gmail.com is stuck in gmail's stack. As far as identities go, its not really a first class citizen. Maybe we shouldn't be building new infrastructure with the assumption of siloed anything. Maybe we should make people without a website feel the pinch.

I'm kinda convinced by the solos argument, but I still want to be able to send encrypted email to my mum. Not because we have anything to hide but more because fuck the surveillance state thats why. Its also much easier to programatically find someone's gravatar than parse an hCard entry.

URLs are a fun idea, but I think email based systems will win in the end amongst regular folk. And thats the hardest question of all: Ultimately, who are we making our software for? If we're making it for tinkerers and hackers, a URL will be fine. If we're making it for my mum, an email address is really the only way to go. There's something really appealing about designing and writing software for a smaller internet with just people who create on it. But its also insular and snobbish, and many of my friends won't make the effort to join me there.

ShareJS 0.7

This is a repost from my old tumblr blog from March 19

A month ago I got hired by Lever and moved over to San Francisco. We're building an applicant tracking system for hiring. Its a realtime web app built on top of Derby and Racer, which is a web framework written by the company's cofounders.

Racer doesn't do proper OT and it doesn't scale. Over the next few months, I'm going to refactor and rewrite big chunks of ShareJS so we can use it underneath racer to keep data in sync between the browser and our servers. I'm going to refactor ShareJS into a few modules (long overdue), add live queries to ShareJS and make the database layer support scaling.

I want feedback on this before I start. I will break things, but I think its worth it in the long term.

So, without further ado, here's the master plan:


Standardized OT Library

First, ShareJS's OT types are written to a simple API and don't depend on any external services. I'm going to pull them out into their own project, akin to libOT.

The types here should be super stable and fast, and preferably written in multiple languages.

I considered adding some simple, reusable OT management code in there too, but by the time I pared OT down until I had something reusable, it was just a for loop.

I'm not sure where the text & JSON API wrappers should go. The wrappers are generally useful, but not coded in a particularly reusable way.

Scalable database backend

Next, we need a scalable version of ShareJS's database code. I want to pull out ShareJS's database code and make it support scaling the server across multiple machines.

I also want to add:

  • Collections: Documents will be scoped by collection. I expect collections to map to SQL tables, mongodb collections or couchdb databases. Collections seem to be a standard, useful thing.
  • Live queries: I want to be able to issue a query saying "get me all docs in the profiles collection with age > 50". The result set should update in realtime as documents are added & removed from that set. This should also work with paginated requests. I don't want to invent my own query language - I'll just use whatever native format the database uses. (SQL select statements, couchdb views, mongo find() queries, etc).
  • Snapshot update hooks: For example, I want to be able to issue a query to a full-text search database (like SOLR) and reuse the same live query mechanism. I imagine this working via a post-update hook that the application can use to update SOLR. As a first pass, I'll poll all outstanding queries against the database when documents are updated, but I can optimise for certain common use cases down the track.

I want to get the API here stable first and let the implementation grow in complexity as we need it to be more scalable and reliable. At first, this code will route all messages through a single redis server. Later I want to set it up with a redis slave for automatic failover and make the server shard between multiple DB instances using consistant hashing of document IDs or something.

I'm nervous about how the DB code and the operational transform algorithm will work. If the DB backend doesn't understand OT, the API will have to be strongly tied to ShareJS's model code and harder to reuse. But if I make it understand OT and subsume ShareJS's model code, it makes the DB code much harder to adapt to work with other databases (you'll need to rewrite all that code!). I really love the state of model.coffee in ShareJS at the moment, though it took me 2 near complete rewrites to get to that point.

I would also like to make a trivial in-memory implementation for examples and for testing. Once I have two implementations and a test suite, it should be possible to rewrite this layer on top of Hadoop or AWS or whatever.

ShareJS code

Whats left for ShareJS?

ShareJS's primary responsibility is to let you access the OT database in a web browser or nodejs client in a way thats secure & safe.

It will (still) have these components:

  • Auth function for limiting reading & writing. I want to extend this for JSON documents to make it easy to restrict / trim access to certain parts of some documents.
  • Session code to manage client sessions. All the protocol stuff thats in session.coffee. I want to rewrite / refactor this to use NodeJS's new streams.
  • Presence, although this will require some rethinking to work with the new database backend stuff.
  • A simple API that lets you tell the server when it has a new client, and pass messages for it. I'm sick of all the nonsense around socket.io, browserchannel, sockjs, etc so I want to just make it the user's problem. Again, this will use the new streams API. This also makes it really easy for applications to send messages to their server that don't have anything to do with OT.
  • Equivalent connection code on the client, currently in client/connection.coffee.
  • Client-side OT code, currently in client/doc.coffee.
  • Build script to bundle up & minify the client and required OT types for the browser. I want to rewrite this in Make. (Sorry windows developers).
  • Tests. It looks like nodeunit is no longer actively maintained, so it might time to port the tests to a different framework. (Suggestions? What does everybody use these days?)

ShareJS has slowly become a grab bag of other stuff that I like. I'm not sure whether all this stuff should stay in ShareJS or what.

There is:

  • The examples. These will wire ShareJS up with browserchannel and express. The examples will add a few dependancies that ShareJS won't otherwise have.
  • The different database backends. Unless someone makes an adapter for my new database code, these are all going to break. Sorry.
  • Browser binding for textareas, ace and codemirror
  • All the ongoing etherpad work. I met a bunch of etherpad & etherpad lite developers at an event last week, and they were awesome. Super happy this is happening.


Thats the gist of the redesign. Some thoughts:

I hate making ShareJS more complicated, but at the same time I think its important to make it actually useful. People need to scale their servers and they need to be able to build complex applications on top of all this stuff. I love how ShareJS's entire server is basically encapsulated in one file, and it'll be a pity to lose that.

This change will break existing code. Sorry. The current DB adapters will break, and putting documents in collections will change APIs all the way though ShareJS.

I'm still not entirely sure how this redesign will interact with my C port of ShareJS. Before I realised how integral ShareJS would be to my current work, I was intending to finish working on my C implementation next. For now, I guess that'll take a back seat. (In exchange, I'll be working on this stuff while at work, and not just on weekends.)

This design allows some nice application features. For example, the auth stuff can much more easily enforce schemas for documents. You could enforce that everything in the 'code' collection has type 'text', everything in the 'projects' collection is JSON (with a particular structure) and items in the 'profiles' directory are only editable by user who owns the profile. You could probably do that before, but it was a bit more subtle.

As I said above, I'm not sure where the line should be drawn between the DB project and the model. If they're two separate projects, they should have a very clear separation of concerns. I'm really trying to build a DB wrapper that provides the API that I want databases to provide directly, in a scalable way. However, that idea is entangled with the OT and presence functionality. What a mess.

I want feedback on all this stuff. I know a lot of people are building cool stuff with ShareJS, or want to. Do these plans make your lives better or worse? Should we keep the current simple incantation of ShareJS around? If I'm taking the time to rip the guts out of ShareJS, what else would you like to see changed? How do these ideas interact with the etherpad interaction work?


Chipmunk in ASM.JS

I'm not sold on emscripten. Its a cool idea, and its impressive that it works at all, but its output seems really stupid. For example, take this C function:

cpMomentForCircle(cpFloat m, cpFloat r1, cpFloat r2, cpVect offset) {
    return m*(0.5f*(r1*r1 + r2*r2) + cpvlengthsq(offset));

This is the asm.js code that emscripten generates using -O2:

function cpMomentForCircle(m, r1, r2, offset) {
  // Type annotations

  // Variable declarations
  var f=0, g=0, h=0.0;

  // Body
  HEAP32[offset>>2] = HEAP32[g>>2];
  HEAP32[offset+4>>2] = HEAP32[g+4>>2];
  HEAP32[offset+8>>2] = HEAP32[g+8>>2];
  HEAP32[offset+12>>2] = HEAP32[g+12>>2];
  h=((r1*r1 + r2*r2)*0.5+ +cpvlengthsq(offset))*m;
  return +h;

How did such a simple function become so complicated? It doesn't need to copy the offset vector onto the stack (in fact, it doesn't need to use the stack at all).

With code like that you can easily see how ChipmunkJS triples in size due to emscripten. You can also see how executables get so big...

So I was curious how hand written asmjs compares, so I ported cpVect to asmjs manually.

The original C:

static inline cpVect cpvclamp(const cpVect v, const cpFloat len)
    return (cpvdot(v,v) > len*len) ? cpvmult(cpvnormalize(v), len) : v;


var vclamp = cp.v.clamp = function(v, len)
    return (vdot(v,v) > len*len) ? vmult(vnormalize(v), len) : v;

And hand written ASMJS:

function vclamp(ret, v, len) {
  ret = ret|0;
  v = v|0;
  len = +len;

  if (+vdot(v, v) > len*len) {
    vnormalize(ret, v);
    vmult(ret, ret, len);
  } else {
    cpv(ret, +f64[v>>3], +f64[v+8>>3]);

(Most methods don't balloon out in complexity so much)

I don't know how my version compares to LLVM's version in terms of speed. I'd like to think that its faster - but I have no idea if thats actually true. I've been told that LLVM will generate better assembly than me almost always.

Its really annoying to write this code - especially dealing with the heap and doing type annotations everywhere. If I were going to convert the whole thing, I'd be better off using LLJS's asmjs branch. That said, lljs hasn't been touched in 3 months and doesn't seem to currently work. I think I would still be ahead time-wise after fixing LLJS though. I don't want to debug those bitshift operations.

The big win is in compiled output size. And here's the crazy part: The hand-written asmjs is smaller than ChipmunkJS!

$ uglifyjs -cm <cpVect.js  | wc -c


$ uglifyjs -m <asm.js  | wc -c

(-c doesn't work on asmjs modules - it breaks some of the type hints)

That asmjs module is at a disadvantage too - it includes a bunch of asmjs boilerplate that is only needed once!

The most interesting part is looking at the minified code. It looks totally different. And its super obvious which is which:

,vperp=cp.v.perp=function(t){return new Vect(-t.y,t.x)},
vpvrperp=cp.v.pvrperp=function(t){return new Vect(t.y,-t
.x)},vproject=cp.v.project=function(t,n){return vmult(n,
(t){return this.mult(vdot(this,t)/vlengthsq(t)),this};va
r vrotate=cp.v.rotate=function(t,n){return new Vect(t.x*
tion(t){return this.x=this.x*t.x-this.y*t.y,this.y=this.

function I(n,r,t){n=n|0;r=r|0;t=t|0;m(n,+y[r>>3]-+y[t>>3
],+y[r+8>>3]-+y[t+8>>3])}function U(n,r){n=n|0;r=r|0;y[n
>>3]=-+y[r>>3];y[n+8>>3]=-+y[r+8>>3]}function z(n,r,t){n
function F(n,r){n=n|0;r=r|0;return+(+y[n>>3]*+y[r>>3]+ +
y[n+8>>3]*+y[r+8>>3])}function _(n,r){n=n|0;r=r|0;return
+(+y[n>>3]*+y[r+8>>3]-+y[n+8>>3]*+y[r>>3])}function b(n,
r){n=n|0;r=r|0;m(n,-+y[r+8>>3],+y[r>>3])}function j(n,r)

I've never seen javascript look so mathsy. GWT, CoffeeScript and the closure compiler all look positively plain compared to that.

The entire cpVect asmjs module is here if you want to see a larger code sample.

In comparison, cpVect.js is here and the original cpVect.h is here.

The only downside is that its super awkward to call any of these methods from normal javascript. Because asm.js modules can't view or edit normal javascript objects, all the stateful data has to exist inside the module's own memory heap & stack. If I go this route, I'll need to wrap the entire chipmunk API in a plain JS API to actually make it usable by normal javascript programs. Its a serious downer. (Emscripten has exactly the same problem).

KeyKitten: Gravatar for keys!

One of the reasons crypto isn't used more is usability. To use PGP there's like 6 scary steps you have to go through. First you have to install gpg, then make your keys (which requires typing some scary stuff and choosing your cypher..!?). Then you have to add your key to your keyring (??), and you should upload it to some random websites that nobody has ever heard of. And you'll still feel guilty because you didn't go to a key signing party. Even after you've done all of that, you need to store your private key somewhere safe so you don't lose it. And who do you trust with your private key?

And good gracious, I hope you don't use windows to do all of that!

Lets face it, my mum is never going to make a pgp key today, and my mum is GMail's target audience, not crypto neckbeards. Which makes message encryption impossible.

We have the same problem on the other side of the fence. If I want to send an encrypted message to jim@example.com, how can I get Jim's public key?

Well, in comes keykitten. The point of keykitten.org is gravatar for keys. Hash jim@example.com to c20266793..., then fetch https://keykitten.org/keys/c20266793d32b1b99e42438807fc7038f89bb326/pgp to get his pgp key. Or you can fetch /ssh to get jim's public ssh key.

The other half of the project is a simple web UI to sign in & upload your keys to the site. I want to make it usable by both my mum and security neckbeards. If you don't have a key, we'll generate you some using browser javascript. If you're worried you'll lose your private key, I'll store a copy of it (but only if you want me to). I'll use persona to sign users in, pin SSL certificates in chrome and firefox (and make the SSL cert widely published).

Neckbeards can go in and upload the pgp key they generated & got signed at key parties. My mum can click the 'figure it out for me' button. And finally, of course, the site should be federated so if you want jim@example.com's key, you should first check example.com/keys/... before looking on keykitten.com.

There's a few fun things you can do with a system like this. Once github knows my email address, they can just look up my ssh public keys to give me access. If I want to let my friend ssh in to my computer, I can add him from my contact book (I have his email address, after all). My computer will fetch his ssh key via keykitten, make an account and add his key via authorized_keys. And finally, it should be much easier to make things like encrypting browser extensions. All the extension needs to know is the recipient and it can figure out how to encrypt data for them.

So thats the plan. Little, tiny, exciting steps.

Dreams, cleverness and the gallows

Last night I dreamed that I was on death row, slowly taken out toward the gallows. I hadn't really done anything wrong. Suddenly my execution was just sort of about to happen. Everyone thought I must have some clever plan to escape - so most of my dream was taken up scheming to cheat death. If I was sneaky and clever enough, I'm sure I could think of something.

And now I'm awake. At some point I'm going to get old and die. I won't die because I'll deserve it. I'll die because this dreamy life might end before I think of a clever escape from death. We're probably in one of the last generations to ever die - how sad is that!

I should shoulder some of that burden. We all should; its sensible. I don't want to leave my potential immortality up to the cleverness of strangers and their willingness to share. Seriously - what do we need to do? Because I'm having way too much fun, and it would just be so sad if I die stupidly because we're too busy partying to be clever.

And I know all that. I've been talking about this stuff, and my crazy AI ideas for basically ever. When will I actually write that code? The most recent HPMOR chapters have kicked me with this stuff again. If I catch some terminal illness in a few years, I will look back on my time now with disgust - as wasted years that I could have been saving my life. Or worse, if someone I love dies because right now I'm spending my life making hiring software instead of AI, then what? There's a gun pointed at my head and yours. Every day there's a small chance it will go off. Today, I'm ignoring it and building cool concurrency systems instead. But we should do something about that gun. If we don't, we will all die, 100%.

We're all on the way to the gallows. Its time to come up with something clever.

ChipmunkJS and Emscripten

I've finally gotten around to compiling Chipmunk to JS using Emscripten to see what happens. It works great.

As a baseline, here's the first benchmark running in C:

Time(a) =  1451.45 ms (benchmark - SimpleTerrainCircles_1000)

The same benchmarks running with chipmunkjs in node 0.10.12 (v8:

$ node bench.js 
SimpleTerrainCircles 1000
Run 1: 22426
Run 2: 21808

(ie, 21 seconds, 15x baseline)

Using v8 head (v8:

$ ../v8/out/native/d8 bench.js
SimpleTerrainCircles 1000
Run 1: 11248
Run 2: 12930

(8x baseline)

Emscripten (-O2 -DNDEBUG), v8: in Chrome Canary:

Time(a) =  3967.12 ms (benchmark - SimpleTerrainCircles_1000)

(2.7x baseline)

In Firefox 22 (which has asmjs support) (Firefox nightly (25a) has about the same performance):

Time(a) =  2044.10 ms (benchmark - SimpleTerrainCircles_1000)

(1.4x - only 40% slower than C!!!)

The V8 team is actively working on making asmjs code run faster in v8. They don't want to have a special 'asm.js mode' like firefox does - instead they're adding optimizations which can kick in for asmjs-style code (source: insiders on the Chrome team). I expect Chrome performance to catch up to firefox performance in the next ~6 months or so.


  • I didn't make any changes to chipmunk (although I did bump chipmunkjs tests runs back up to 1000 to match chipmunk). My test code is here

  • I compiled the benchmark code from C using emscripten. If your game is written in javascript, performance will be worse than this.

  • These numbers are approximate. I didn't run the benchmarks multiple times and I have a million things open on my machine. I doubt they'll be off by more than ~10% though.

  • Downloaded filesize increases by nearly 3x. Chipmunk-js is 170k, or 17k minified & gzipped. With emscripten the output is 300k, minified & gzipped to 49k. This is way bigger.

  • We can expose most of chipmunk directly to javascript. Unfortunately, we can't share vectors from inside the emscripten environment and outside of it - emscripten (obviously) inlines vectors inside its own heap & stack. In javascript, the best we can do is use objects in the JS heap. Our options are either removing vectors (as much as possible) from the API (cpBodySetPos(v) -> cpBodySetPos(x, y)), writing some javascript wrappers around everything to bridge between a javascript vector type and a C vector type or putting vectors in the emscripten heap (which would be faster than a JS bridge, but require that you match cpv() calls with cpvFree() or something. All options are kind of nasty.

  • Emscripten doesn't use the GC, so you can now leak memory if you don't cpSpaceFree(), etc.

  • As well as running faster, its easier to port code like this. Keeping chipmunkjs updated with the latest version of chipmunk should mostly just require a rebuild.

Privacy and trust: Pick one.

How do you prove your identity on the internet?

Well, the standard PGP model is a web of trust. I present you with my certificate, which is signed by all these other people. How do you know I didn't just make all these people up? Well, their certificates are signed by yet more people. We have this whole web of people-who-know-people going on.

The problem is that to be able to use this network, the whole thing must be public. (Otherwise, an attacker might have just invented a couple of people to sign their certificate).

You could ask random people on the street to verify your identity, but the only way they can tell who you are is by looking at government provided ID or something. The people you really want verifying your identity are the people who have known you in person for several years. Who knows you that well? Friends and family. In a way, the best web of trust is your Facebook friend network. Except, to be effective, you have to make your network visible.

This is a strange conflict that any web of trust has. If the network isn't public, you can't trust the network. If the network is public, you have to give up privacy.

A very surprising conflict.

XMPP in Wave in a box

From the wave in a box mailing list:

On Tue, Jun 11, 2013 at 4:08 PM, Dave wave@glark.co.uk wrote:

Protobuffs in XMPP might not be the most elegant wire protocol, but they're both proven, solid messaging technologies. I can see appeal in replacing them, but for my money the path of least resistance would be to improve these implementations.

They are???

Because you know, wave federation has never worked reliably. I gave a talk back in 2010 explaining how to set up federation in a virtual machine. I practiced my talk a few times, and sometimes (with exactly the same certificates, configuration files, software & OS) it simply wouldn't work. Why not? Two and a half YEARS have passed and still nobody knows.

XMPP server extensions are supposed to be a standard, but last I checked, only a couple of XMPP servers even half-work with wave in a box. Why? Because everyone has implemented the XMPP standard in slightly different, incompatible ways. Are the bugs in WIAB itself or in the XMPP servers? I don't know! The broken behaviour is so complicated that nobody understands how it works, let alone what the problem is.

Its pretty clear that we can't maintain federation based on an XMPP extension. We can't even fix obvious, repeatable bugs.

And we don't even use XMPP for anything! I thought we'd at least use the XMPP server to log in users but we don't even do that - we maintain our own user list. Its like we're using XMPP as a buggy, hard to configure TCP stream that nobody understands. I guess at least it checks our SSL certificates. Whoopie.

I don't understand all the hero worship of XMPP. Has anyone else tried to actually read the spec? There's a reason Google, Apple, Facebook, etc don't use XMPP. Its not because they hate freedom. Its because XMPP is awful.

If you want to convince anyone that its a 'proven, solid messaging technology', you should start by fixing our XMPP extension code.

I think we should kill it with fire.