I’m living in America, and I don’t trust the government here. They read my text messages and my email and they can get access to anything my cloud providers store on me. I also don’t trust Saudi Arabia with my data. Why should I?
I didn’t mind this sort of thing while I was in Australia. It seems pretty benign, although I’m pretty sure my emails were being recorded by the US government anyway. But allowing governments to record us seems pretty profoundly stupid. What the hell are we doing? Why do we use systems that leak information about ourselves? Why aren’t we using secure systems that don’t have these kind of flaws?
I guess its hard. What would we need to solve this problem? Lets first consider the attack vector we’re trying to protect against.
So imagine you work for the NSA. Your mandate is: “We want as much information on terrorists as possible. We want to find them and we want to learn everything we can about them so we can foil their dastardly plans and arrest them!”
Your operation has two parts:
- Gather as much information on as many people as possible. Do data mining on your data set to find suspicious people.
- When someone is considered suspicious, learn everything you can about the person. If you still think they’re suspicious, arrest them.
For part 1 you want as much information as possible on as many people as possible. But if you’re trying to snoop on 300 million people, you’re going to be spread a bit thin. Everything you do needs to be automatic and cheap. So you enlist the co-operation of certain companies (like AT&T) to get unencrypted data feeds of phone lines and the internet.
Simply sniffing traffic will give you access to most unencrypted email and let you record HTTP requests to see what people are looking at. You won’t be able to see email passing within a network without the permission of the network operator (for example, traffic from one gmail user to another), but you can sniff email moving between email providers (eg, gmail to hotmail). The US government probably stores all this data forever, and they’ve built a shiny new data center to house it all in Utah.
They probably scan the data in realtime to look for threats. But just in case their scanning doesn’t find something, they’ll save everything to disk anyway. This means that if in 20 years someone wants to make you an enemy of the state, they can pull up that angry email you wrote when your girlfriend dumped you and you needed to vent.
Once someone has been identified as suspicious, you want more information about them. You’re the government, so there’s an awful lot you can do at this point, especially with National Security Letters. National security letters let you request personal information from web hosts and simultaneously stop the host telling their customer about the intrusion. Here’s what I would do at this point:
- Go to their email host and get every email they ever sent to anyone.
- Intercept their mobile phone and computer. Use an SSL intercept tool to see everything they do on the web. To do this you’ll need a root signing certificate, but you probably already have one.
- You were already collecting their SMSes and recording their phone calls. Go listen to them.
- Go to facebook and get a list of their friends and their conversations there, too. Do the same data gathering task on any of their friends who look suspicious, too.
All information that 3rd parties store about you is fair game. They aren’t even allowed to tell you your data is being accessed, and they don’t need a court order.
Protecting yourself from this nonsense
So the next question to ask is, how do we protect ourselves from this intrusion? I don’t want privacy so I can go sneaking around. I want it because the thought that other people might read my personal emails is creepy. Who knows what they’ll find in there? Or when they’ll look - remember, they’re storing this stuff forever. I don’t know about you, but you extracted the right quotes from the complete works of Joseph Gentle you could find enough in there to hang me hundreds of times over. Even just talking to the police is a bad idea.
Luckily, the universe believes in encryption. Even without legislation, we have the means to nip this intrusion in the bud. The solution is quite simple:
To stop passive wiretaps, we need to encrypt everything going over the wire. At the very least, site operators should adopt https everywhere. Its a travesty that most web traffic and email is sent unencrypted over the open internet.
It would help, but simply encrypting everything over the wire isn’t good enough if the government can request or demand access to site operator’s computers and networks. We also need end to end encryption. My data shouldn’t be readable by anyone, even the people storing my data on my behalf.
End-to-end encryption faces two big problems. First, its really hard (expensive) to do correctly. Secondly, site operators are actively disincentivised from supporting this. If I encrypt all the email I send, google can’t show me adds and I can’t search my mail. We could probably figure out ways to make this stuff work, but today its a hard problem.
Right now, the spooks are winning the eavesdropping war not because they are clever, but because we are lazy. Our data is like a bike that keeps getting stolen. We know how to build bike locks - they’re just impractical. We can do better, and we should.
I recently read A plea for better opensource etiquette. Its basically a whinge about github pull requests and issues getting ignored. This is a really hard topic, because opensource work is mostly volunteer work, and everyone gets precious about their contributions. But most people have no idea what its actually like to run an opensource project. Dealing with mountains of patches and bug requests is this awful time sink deluge. Every time someone submits a new pull request, I have three bad options: - I can stop what I’m doing several times to triage the patch (best case). About 2/3rds of the time, people just forget about their patches when I do this and they hover around and bother me. - I can accept the patch with its (often unknown) faults - I can ignore it and deal with it later, or hope someone else with commit access will find time to deal with it
As a maintainer of a project with almost 2000 stars on github, I sway back and forth between being allowing patches (“Just submit code to get what you want done. We can fix it once its in the repository”) and being a stickler for details (“I don’t want your change until you give me tests.”). I really wish I could just give everyone who submits patches commit access and let them at the repository. It would save me time. It would save you time. It would make my project better. I certainly don’t write opensource software because I enjoy being a policeman; I hate it. I hate saying no to real work that solves real problems.
But the sad, unfortunate truth is the majority of pull requests aren’t very good.
- Many do not understand the project’s conventions (eg, if my project is written in coffeescript, so your code needs to be in coffeescript too). Example: https://github.com/josephg/ShareJS/pull/36/files . After I insisted on the code being ported to coffeescript, several users whinged on the project’s mailing list about my language choice.
- Most pull requests don’t have tests. Submitters usually don’t even run the unit tests before submitting, and their pull requests often break the tests we do have. I understand that you don’t write tests in your application, but the rules are different in an infrastructure project that many people rely on. I don’t trust myself to write bug free code without tests, and I understand my project way better than you do.
- People often have extra modifications in their PR that have nothing to do with the patch. For example, this pull request has good parts near the top of the diff, but right down the bottom it makes some unrelated changes that make my code uglier in the name of ‘optimization’. Quotes because benchmarks weren’t actually run. The changes didn’t improve performance, they just made my project ugly and broke my whitespace convensions. https://github.com/josephg/Chipmunk-js/pull/15/files
I usually err on the side of allowing changes and fixing stuff later, but my projects have suffered for this several times. I’ve had maintainers accept pull requests that break the unit tests (and leave them broken for weeks). I’ve had otherwise normal code suddenly sprout extra levels of indentation. I’ve had mountains of bugs appear, filed against a feature I didn’t write, don’t understand and I never use, who’s author has disappeared. I’ve been burned enough that I can totally understand maintainers who ignore patches and bug reports.
I still love everyone who cares about my projects enough to submit a bug report or takes the time to make a pull request. I need contributors to make good software, but it breaks my heart when nice people submit slightly bad code, and I need to either whinge at you or stop working on my own pet feature to clean up your mess. I’m sorry to everyone who’s patches gets ignored, but sometimes I get tired too.
In short, its fun to complain about The Man because your precious donation of code is being ignored. But its just as thankless running a project, only we shoulder way more responsibility and burn way more time doing it. If you want me to look at more of your bug reports, help triage my other pending bugs and pull requests. If you want me to stay excited about the project, email me to say how much you like it, and tell me about the cool things you’re doing with my code. If you want commit access, ask for it. Finally, if you think you can do a better job running a project, use the fork button and do something about it.
I was inspired by Machinations at GDC, so Jeremy and I thought it’d be fun to try making my own version. Its still super primitive (you can’t even create nodes using the editor), but its fun making it. I’ve never made a graph editor before, though I’ve wanted to have a custom graph editor a bunch of times. Useful code++
You can play with it here. Run it with spacebar. The source is all on github. I’m not sure how much more we’ll keep working on it, although it’d be a great example project to port to ShareJS and google drive realtime to compare the APIs.
Sorry about the mangled formatting. I need another blog layout or something…
A month ago I got hired by Lever and moved over to San Francisco. We’re building an applicant tracking system for hiring. Its a realtime web app built on top of Derby and Racer, which is a web framework written by the company’s cofounders.
Racer doesn’t do proper OT and it doesn’t scale. Over the next few months, I’m going to refactor and rewrite big chunks of ShareJS so we can use it underneath racer to keep data in sync between the browser and our servers. I’m going to refactor ShareJS into a few modules (long overdue), add live queries to ShareJS and make the database layer support scaling.
I want feedback on this before I start. I will break things, but I think its worth it in the long term.
So, without further ado, here’s the master plan:
Standardized OT Library
First, ShareJS’s OT types are written to a simple API and don’t depend on any external services. I’m going to pull them out into their own project, akin to libOT.
The types here should be super stable and fast, and preferably written in multiple languages.
I considered adding some simple, reusable OT management code in there too, but by the time I pared OT down until I had something reusable, it was just a for loop.
I’m not sure where the text & JSON API wrappers should go. The wrappers are generally useful, but not coded in a particularly reusable way.
Scalable database backend
Next, we need a scalable version of ShareJS’s database code. I want to pull out ShareJS’s database code and make it support scaling the server across multiple machines.
I also want to add:
- Collections: Documents will be scoped by collection. I expect collections to map to SQL tables, mongodb collections or couchdb databases. Collections seem to be a standard, useful thing.
- Live queries: I want to be able to issue a query saying “get me all docs in the profiles collection with age > 50”. The result set should update in realtime as documents are added & removed from that set. This should also work with paginated requests. I don’t want to invent my own query language - I’ll just use whatever native format the database uses. (SQL select statements, couchdb views, mongo find() queries, etc).
- Snapshot update hooks: For example, I want to be able to issue a query to a full-text search database (like SOLR) and reuse the same live query mechanism. I imagine this working via a post-update hook that the application can use to update SOLR. As a first pass, I’ll poll all outstanding queries against the database when documents are updated, but I can optimise for certain common use cases down the track.
I want to get the API here stable first and let the implementation grow in complexity as we need it to be more scalable and reliable. At first, this code will route all messages through a single redis server. Later I want to set it up with a redis slave for automatic failover and make the server shard between multiple DB instances using consistant hashing of document IDs or something.
I’m nervous about how the DB code and the operational transform algorithm will work. If the DB backend doesn’t understand OT, the API will have to be strongly tied to ShareJS’s model code and harder to reuse. But if I make it understand OT and subsume ShareJS’s model code, it makes the DB code much harder to adapt to work with other databases (you’ll need to rewrite all that code!). I really love the state of model.coffee in ShareJS at the moment, though it took me 2 near complete rewrites to get to that point.
I would also like to make a trivial in-memory implementation for examples and for testing. Once I have two implementations and a test suite, it should be possible to rewrite this layer on top of Hadoop or AWS or whatever.
Whats left for ShareJS?
ShareJS’s primary responsibility is to let you access the OT database in a web browser or nodejs client in a way thats secure & safe.
It will (still) have these components:
- Auth function for limiting reading & writing. I want to extend this for JSON documents to make it easy to restrict / trim access to certain parts of some documents.
- Session code to manage client sessions. All the protocol stuff thats in session.coffee. I want to rewrite / refactor this to use NodeJS’s new streams.
- Presence, although this will require some rethinking to work with the new database backend stuff.
- A simple API that lets you tell the server when it has a new client, and pass messages for it. I’m sick of all the nonsense around socket.io, browserchannel, sockjs, etc so I want to just make it the user’s problem. Again, this will use the new streams API. This also makes it really easy for applications to send messages to their server that don’t have anything to do with OT.
- Equivalent connection code on the client, currently in client/connection.coffee.
- Client-side OT code, currently in client/doc.coffee.
- Build script to bundle up & minify the client and required OT types for the browser. I want to rewrite this in Make. (Sorry windows developers).
- Tests. It looks like nodeunit is no longer actively maintained, so it might time to port the tests to a different framework. (Suggestions? What does everybody use these days?)
ShareJS has slowly become a grab bag of other stuff that I like. I’m not sure whether all this stuff should stay in ShareJS or what.
- The examples. These will wire ShareJS up with browserchannel and express. The examples will add a few dependancies that ShareJS won’t otherwise have.
- The different database backends. Unless someone makes an adapter for my new database code, these are all going to break. Sorry.
- Browser binding for textareas, ace and codemirror
- All the ongoing etherpad work. I met a bunch of etherpad & etherpad lite developers at an event last week, and they were awesome. Super happy this is happening.
Thats the gist of the redesign. Some thoughts:
I hate making ShareJS more complicated, but at the same time I think its important to make it actually useful. People need to scale their servers and they need to be able to build complex applications on top of all this stuff. I love how ShareJS’s entire server is basically encapsulated in one file, and it’ll be a pity to lose that.
This change will break existing code. Sorry. The current DB adapters will break, and putting documents in collections will change APIs all the way though ShareJS.
I’m still not entirely sure how this redesign will interact with my C port of ShareJS. Before I realised how integral ShareJS would be to my current work, I was intending to finish working on my C implementation next. For now, I guess that’ll take a back seat. (In exchange, I’ll be working on this stuff while at work, and not just on weekends.)
This design allows some nice application features. For example, the auth stuff can much more easily enforce schemas for documents. You could enforce that everything in the ‘code’ collection has type ‘text’, everything in the ‘projects’ collection is JSON (with a particular structure) and items in the ‘profiles’ directory are only editable by user who owns the profile. You could probably do that before, but it was a bit more subtle.
As I said above, I’m not sure where the line should be drawn between the DB project and the model. If they’re two separate projects, they should have a very clear separation of concerns. I’m really trying to build a DB wrapper that provides the API that I want databases to provide directly, in a scalable way. However, that idea is entangled with the OT and presence functionality. What a mess.
I want feedback on all this stuff. I know a lot of people are building cool stuff with ShareJS, or want to. Do these plans make your lives better or worse? Should we keep the current simple incantation of ShareJS around? If I’m taking the time to rip the guts out of ShareJS, what else would you like to see changed? How do these ideas interact with the etherpad interaction work?
Imagine you were suddenly, irrevocably teleported back in time a few hundred years. You’re stuck in a time before cars and electricity. The slave trade is alive and well. Germs haven’t been invented yet. The average life expectancy is around 25 (I’m not even joking) and nobody is particularly well educated.
The question is, what do you do with your life? Do you live a quiet life like the barbarians around you and raise a family? Seems like a bit of a wasted opportunity. Even if you just run around telling people to wash their hands, you’ll save countless lives.
How far should you push it? Do you tell them about immunisations, radio and Hitler? Do you fight the slave trade? Convincing people that you aren’t crazy sounds really time consuming, and sort of lonely. But the number of lives you would touch would be huge. How many amazing people have gone missing from the last 200 years because they didn’t go to school, their parents died of Polio or they had the wrong skin colour?
Its an interesting question because you aren’t going to simply convince people to free their slaves. The “telling people they’re wrong” business doesn’t put food on the table. If you had to become a pariah to make the world better, would you still do it? Would you give up having a family if you had to, to focus on your work, whatever that was? If you had to, would you be willing to be hated and spat on, imprisoned or publicly mocked to make that happen? Would you do it even if you might fail anyway? Would you struggle to buy other people lifetimes? Is it even ethical to say no?
I ask because we have the same choice every day. The future is probably going to be really super awesome. Thanks to genetic engineering, some time in the next fifty years we’re probably going to figure out how to stop ageing. And when we get proper AI going, gee… where do I start? At a minimum its going to make human labour unnecessary for survival. In comparison, here in 2013 our lives are short and crappy. We are barbarians; predecessors to humanity’s true form. And we’re also the time travellers with hazy memories of the great future before us.
The question is, how much do you want to make that future happen now? Relatively speaking, there’s a tiny, insignificant number of people actually working on this stuff. You could help, though you might have to give up your cushy job and lifestyle to do it. You might have to burn those savings to go back to university and study biology, with the hope of working in an underfunded research lab. And for that, you’ll probably fail. Certainly everyone who’s tried to make strong AI so far has failed.
This is my problem. I think of myself as someone who could make a contribution. I want the future I imagine so badly, but every time I’ve actually worked on AI stuff its been hard, lonely and demotivating. The head of the AI group I worked for genuinely doesn’t think the sort of thing I want to make is possible. And he may well be right! Another option is running off to join a genetic engineering company, but what would I be doing for them? Making websites and stuff for bad pay. I mean, I don’t know any biology.
Playing video games on the sidelines while I wait for other people to make the future for me seems kind of … unethical? Ignoble? Is that happy me who I really want to be?
And finally, maybe this is all a false dichotomy anyway. Maybe there’s ways to work on crazy hard AI and enjoy my life at the same time. Surely somehow.
It was kind of funny seeing everyone again last night at the 10 year reunion. I think I basically didn’t meet any of you in high school. Maybe we were even friends back then, but the people we were 10 years ago seem like fundamentally different people to who we are now.
Back in high school, we formed our identities from our quirks and flaws. We formed our identities from the expectations of our friends, and we looked at everyone else and became whatever was left over. And we wore those identities like cloaks through school, to show off and hide in in equal measure. In part, I wanted to be smart and good at programming so if you asked me “who are you?” I had an answer that differentiated me from all of you.
In short, I was who I was because of how I reacted to random shit happening around me. And thats a terrible way to choose who you want to be. I rolled a 4, I guess I’m a quiet nerdy guy. I suppose I’ll head to the library and get good at maths then.
I didn’t even see my identity as a choice until years later. And that choice is the fundamental change I saw last night. We aren’t the people I remember from school. Each of us chooses our own identity now. We choose consciously, and we choose deliberately. We don’t just play dice. We decide who we are.
Thats who I met last night. People whose personalities haven’t budged but whose character has spent 10 years taking root. 10 years of little decisions about who we want as friends and as lovers. 10 years of decisions about how we want to make a difference in the world - through our work or through the families we’re creating. Or not at all (:D). 10 years to travel, to learn boxing, to start companies, to council criminals, make babies and recite poetry to school children.
So I guess, it was very nice meeting you all, and I wish you the very best.
For awhile we thought all the unicode characters were going to fit in 2 bytes. Aah it was a heady time. Forward looking companies like Microsoft, Apple, Sun and Netscape started using arrays of 2-byte integers to store characters using an encoding called UCS-2. But the unicode consortium just kept adding more crap to unicode; and soon enough enough we overflowed the 65535 available characters. Now we have the Supplemental Character Planes with such characters as the G-Clef: 𝄞 (0x1D11E) and the mathematical symbol for wedding cards: 𝒲 (0x1D4B2).
What to do? 4 bytes for a single character is ridiculous. So we invented a new encoding called UTF-16 which encodes all the old, useful unicode characters in the same 2 bytes. The new characters use a special encoding which stores the character data across two ‘characters’ (4 bytes).
Consider the implementation of a string. Using UCS-2 (two bytes for every character), strings are implemented as arrays of shorts. String.length returns the size of the array. The 100th character in the string is the 100th element in the array. But in UTF-16 or UTF-8, different characters can take up different numbers of bytes. Finding the nth character in a string requires scanning all the previous characters; which is an O(n) operation.
Well sure; they could just update string.charAtIndex() and string.length functions to behave correctly in the presence of UTF-16 characters, but thats dangerous to old code. Old (working) code might suddenly start performing really badly, or exhibit new security vulnerabilities (How many bytes should we allocate for this 100 character string? 200 bytes might not be enough!)
So they left strings as arrays of shorts. Now we have this awesome HORRIBLE behaviour:
$ node -e 'console.log("𝄞".length)' 2
NSLog(@"%ld", [[NSString stringWithUTF8String:"𝄞"] length]); // 2
System.out.println("\uD834\uDD1E".length()); // Prints 2
>>> len(u"\U0001D12B") 2
… And the same thing in pretty much every language from the era. Bonus points: What does substring do, if you try to split the 𝄞 into its apparent two characters? How many times do you need to press the right arrow key in your text editor to move past that character?
(Props to C, Dart, Ruby and Go which all handle this correctly.)
So my question is: Lets say you’re writing a text-based OT system. Your document contains “𝄞”. You want to insert at the end of the string. Is that at position 1 or position 2? If you want your system to work in both broken and non-broken programming languages, you need to convert. If you don’t, a single 𝄫 will make your whole system fall flat (ha!). How do you convert the broken and non broken offsets without scanning every character from the start of the document; which would obviously be slow for big documents?
Apparently my benchmark stats are accurate! SGI’s
I really hope nobody uses it. It makes me worry about the performance of the rest of the STL if that slow behemoth of code made it into the standard library.
At some point I want to port ShareJS to C, because tiny, fast projects like redis are sexy.
The simplest solution is to break documents into an array of lines. But that complicates the OT code, and doesn’t fix the asymptotic time anyway. - O(number of characters) becomes O(number of lines + max line length). Its still slow when you have lots of lines or binary data. But most editors seem to do that anyway, because worse is better and all that.
The right thing is to use a tree or skip list of smaller character strings, aka a Rope (its called a rope because its like a complex string. Very funny, yes?)
Ropes in C
So I thought maybe this is a good time to revisit the whole problem. I want my library to be straight C (I hate C++), so SGI’s rope implementation is out. SGI’s implementation also doesn’t support unicode, and I want to be able to insert & delete at UTF-8 character offsets.
Benchmarking these libraries is a bitch. Performance is dependant on the document size, which changes with every edit. I need to get some actual edit data from someone creating a document to see how it fares for real.
There’s also another unexpected boon in a library like this. It turns out that skip lists can be used to convert line & character offset pairs from an editor into a single global character offset in O(log n) time. I can also use the same trick to store & use tombstones for sharejs’s TP2 text type.
The benchmark I really want to see is my skip list rope compared to a tree rope implementation. It wouldn’t surprise me if trees get better performance that skip lists - I really have no idea. I tried comparing my implementation to the SGI rope implementation in C++ and I got this:
benchmarking librope did 500000 iterations in 157.574000 ms: 3.173112 Miter/sec final string length: 5799 benchmarking c string did 500000 iterations in 1556.024000 ms: 0.321332 Miter/sec final string length: 5799 benchmarking sgirope did 500000 iterations in 8442.429000 ms: 0.059225 Miter/sec final string length: 5799
Thats a slaughter, not a competition.