Apple dictionaries, part 2

After I wrote my post about Apple's dictionary files, I got a mysterious email showing up in my inbox. The email was from someone who's spent some time writing code to do the same thing, but doesn't want to post it under his own name in case he falls fowl of his country's DMCA equivalent. Crazy. He said I could post his code under the condition that I took his name off it.

The code he sent me scans the data file in a dictionary looking for gzipped data chunks. The tool simply inflates the chunks and prints out the result to stdout.

Each uncompressed chunk starts with a (binary) 4 byte length field (I'm not sure why it needs this), then has XML with a dictionary entry itself. In all, the english language dictionary comes out to about 180 megs of text.

Entries look a bit like this:

<d:entry id="dictionary_application" d:title="Dictionary application">  
    <d:index d:value="Dictionary application"/>
    <div d:priority="2"><h1>Dictionary application </h1></div>
        An application to look up dictionary on Mac OS X.<br/>
        It's application icon looks like below.
    <img src="Images/dictionary.png" alt=" Icon"/>

I thought about pumping all this XML into an XML parser, but some of the 4 byte length fields coming out of the extracting tool are misaligned. With that in mind, I wouldn't be surprised if some of the XML entries I got out weren't valid XML at all. I don't know why this is - there's probably a bug somewhere, but it doesn't really matter for me. All I wanted is the word list, and I can get that easily enough from the XML by looking for d:title= attributes with grep.

With code:

$ clang dedict.c -Wall -lz -o dedict
$ clang strip.c -Wall -o strip
$ ./dedict "Oxford Dictionary of English" | ./strip > dict.xml
$ egrep -o 'd:title="(.+?)"' dict.xml | awk -F\" '{print $2}' > words

(Edit: If you're trying this now, use the updated python script listed at the bottom of that gist)

As a word list, the dictionary is frankly pretty disappointing. Amongst other things, it has lots of compound phrases in there I didn't really want:

platinum black  
platinum blonde  
platinum disc  
platinum metals  

I guess all these things sort of mean something different. Like, you wouldn't understand what "platinum blond" means if you just look up the words "platinum" and "blond" individually. But I can't help but feel that the SCOWL project word list is better for most programatic tasks.

Maybe I can make use of the definitions at some point. And Apple ships with 21 dictionaries including a thesaurus. If nothing else, there's probably some sweet visualisations waiting to be made.

I'm sure someone out there has some more creative uses for this data set than I do. It was a lot of fun pulling the dictionary apart nevertheless.

Edit 2018: Someone ported the C code to a much nicer & smaller implementation in python in the github gist. I got it working again in MacOS 10.14, where Apple moved all the dictionaries over to some ungodly asset path.