String.length LIES!

For awhile we thought all the unicode characters were going to fit in 2 bytes. Aah it was a heady time. Forward looking companies like Microsoft, Apple, Sun and Netscape started using arrays of 2-byte integers to store characters using an encoding called UCS-2. But the unicode consortium just kept adding more crap to unicode; and soon enough enough we overflowed the 65535 available characters. Now we have the Supplemental Character Planes with such characters as the G-Clef: 𝄞 (0x1D11E) and the mathematical symbol for wedding cards: 𝒲 (0x1D4B2).

What to do? 4 bytes for a single character is ridiculous. So we invented a new encoding called UTF-16 which encodes all the old, useful unicode characters in the same 2 bytes. The new characters use a special encoding which stores the character data across two 'characters' (4 bytes).

So of course Java, Javascript, C#, Cocoa (Objective-C), and Microsoft all changed their character type back to one byte and started using UTF-8 because its better in every way? Ha no. Or maybe they ported their string libraries to UTF-16? ... Well...

Consider the implementation of a string. Using UCS-2 (two bytes for every character), strings are implemented as arrays of shorts. String.length returns the size of the array. The 100th character in the string is the 100th element in the array. But in UTF-16 or UTF-8, different characters can take up different numbers of bytes. Finding the nth character in a string requires scanning all the previous characters; which is an O(n) operation.

Well sure; they could just update string.charAtIndex() and string.length functions to behave correctly in the presence of UTF-16 characters, but thats dangerous to old code. Old (working) code might suddenly start performing really badly, or exhibit new security vulnerabilities (How many bytes should we allocate for this 100 character string? 200 bytes might not be enough!)

So they left strings as arrays of shorts. Now we have this awesome HORRIBLE behaviour:

Javascript

$ node -e 'console.log("𝄞".length)'
2  

Objective-C

NSLog(@"%ld", [[NSString stringWithUTF8String:"𝄞"] length]); // 2  

Java

System.out.println("\uD834\uDD1E".length()); // Prints 2  

Python 2.x

>>> len(u"\U0001D12B")
2  

... And the same thing in pretty much every language from the era. Bonus points: What does substring do, if you try to split the 𝄞 into its apparent two characters? How many times do you need to press the right arrow key in your text editor to move past that character?

(Props to C, Dart, Ruby, Python 3 and Go which all handle this correctly.)

So my question is: Lets say you're writing a text-based OT system. Your document contains "𝄞". You want to insert at the end of the string. Is that at position 1 or position 2? If you want your system to work in both broken and non-broken programming languages, you need to convert. If you don't, a single 𝄫 will make your whole system fall flat (ha!). How do you convert the broken and non broken offsets without scanning every character from the start of the document; which would obviously be slow for big documents?

Ugh.