Monday, August 27, 2007

High performance Objective-C I: a Postscript interpreter

A key component of the metaobject product suite is EGOS, which includes as a central ingredient a custom Postscript Level 3 compatible interpreter. The project was started in part as a hedge against the chance of Apple dropping DisplayPostscript, in part because our Postscript virtualization technique was hitting limits, and in part because it would make getting Objective-C objects out of the interpreter much easier.

At its core, Postscript is a stack-oriented, dynamically typed and highly polymorphic interpreted programming language. So implementing Postscript with Objective-C objects is actually not just convenient when you want to get Objective-C objects out, it is also a good match for the semantics of the language.

So all is good, right? Well, we also need to make sure that performance is competitive, otherwise there really isn't much of a point. How do we find out if performance is competitive? Fortunately, we have the gold standard handily available: Adobe's interpreter was not just used in NeXT's DisplayPostscript, but is also available as the PS Normalizer on Mac OS X . So let's test performance with a little Postscript program:

  %!
  usertime
  0 1 1000000 { 4 mul pop } bind for
  usertime exch sub dup ==
  20 20 moveto /Times-Roman 24 selectfont
  100 string cvs show ( ms) show
  showpage
The program times a loop that multiplies some numbers one million times. It exercises a good deal of the basic execution machinery in the Postscript language: stack manipulation, procedure invocation, array access (a procedure is just an array with the executable bit set), looping and arithmetic. The loop is timed with the usertime command, which returns CPU time used in milliseconds.

This test clocks in at 513 ms (513 ns per iteration) in Preview, which isn't too shabby.

1. The problem

As proof of concept, let's code up some Objective-C equivalent of what the Postscript interpreter has to do in this loop. That should give us a good lower bound for the time taken (lower bound because there will be additional interpretation overhead, and Postscript semantics are slightly more complicated). We need a stack, some number objects and a bit of arithmetic. Easy:
 id startcounter=[NSNumber numberWithInt:0];
 id endcounter=[NSNumber numberWithInt:1000000];
 id counter=startcounter;
 id four=[NSNumber numberWithInt:4];
 while ( [counter intValue] < [endcounter intValue] ) {
  int intResult;
  id result;
  [stack addObject:counter];
  [stack addObject:four];
  intResult = [[stack lastObject] intValue] * [[stack objectAtIndex:[stack count]-2] intValue];
  result=[NSNumber numberWithInt:intResult];
  [stack removeLastObject];
  [stack removeLastObject];
  [stack addObject:result];
  [stack removeLastObject];
  counter=[NSNumber numberWithInt:[counter intValue]+1];
 }
 
Sadly, this takes 4.8 µs per iteration, so our 'lower' bound is almost 10 times slower than our target, and that's without accounting for interpretation. Clearly not good enough. What if we get rid of all that silly stack manipulation code and use a plain C loop?
  id b=[NSNumber numberWithInt:4];
  for (i=0;i < 10000000;i++) {
  id a=[NSNumber numberWithInt:i];
  id c=[NSNumber numberWithInt:[a intValue] * [b intValue]];
 }

2. Mutable State

Objective-C is an imperative object oriented language, meaning objects can change state. However, we have treated numbers as immutable value objects, requiring them to be recreated from scratch. Allocating objects tends to be around 25x more costly than an Objective-C message send, so what if we don't allocate new integer objects, but instead reuse an existing one and just change its value? It turns out we can't use NSNumber for this as it doesn't allow its value to be set, so we need a (trivial) wrapper class for a single integer.
 
   id b=[MPWInteger numberWithInt:4];
   id a=[MPWInteger numberWithInt:0];
   id c=[MPWInteger numberWithInt:0];
   for (i=0;i <10000000;i++) {
  [a setIntValue:i];
  [c setIntValue:[a intValue] * [b intValue]];
  }

That's more like it: 50ns per iteration is 100x better than our first attempt and also 10x better than the target we're aiming for. So taking advantage of mutable state makes our basic plan possible, at least in principle. Of course, we now have to reintroduce the stack and add interpretation.

3. Save the planet

Alas, it turns out that the interpreter really does need fresh instances. While it will discard them quickly in most cases, it sometimes stores them away meaning we can't statically reuse objects the way we did above.

Instead, we need to figure out a way to recycle temporary objects so we can reuse them without spending a lot of time. The common way to do this is to keep a pool of objects from which requests for new MPWInteger instances are satisfied. However, due to the unpredictable nature of the interpreted code, we cannot use the explicit checkin/checkout policy such pools usually require.

Instead we make the pool a circular buffer and use the retain count to verify that an object can be reused. When we get to a position in the pool that has an object, we can reuse that object if the retain count is one, meaning that only the pool has a valid reference. If the retain count of the object is greater than one, someone other than the pool is holding on to the object and it cannot be reused (yet), so we need to get another instance.

This logic is encapsulated in the class MPWObjectCache, which can be used very similarly to a class (factory object) in creating new instances.

 MPWObjectCache* intCache=[[MPWObjectCache alloc] initWithCapacity:20 
        class:[MPWInteger class]];
 id b=[MPWInteger integer:5];
 for (i=0;i < 1000000;i++) {
  id a=GETOBJECT(intCache);
  id d=GETOBJECT(intCache);
  [a setIntValue:i];
  [d setIntValue:[a intValue] * [b intValue]];

This code runs in 100ns per iteration,  so we now have a solution that gives us new or safely recycled objects quickly enough to build on with the confidence the end result will perform acceptably.

4.  Results

Running the Postscript test program from the start of this post in PostView yields a result of 260ns per iteration, meaning that our Objective-C Postscript interpreter is almost twice as fast as Adobe's, at least on this particular workload.  While I wouldn't generalize this isolated result to say that EGOS is a faster interpreter, it clearly shows that it is at least competitive, which was the goal of the exercise.
The fact that it took a measly 20 KLOC illustrates the leverage Objective-C provides:  Ghostscript weighs in at around 250+ KLOC (without drivers).
5.  Conclusion
One of the things I've always liked about Objective-C is that it lets you have your cake and eat it, too:  great expressiveness to solve your problem effectively is always coupled with the ability to get down and dirty and get really great performance, without losing the structure of the original solution.
The most important factor to watch out for in terms of performance tends to be object allocation.  Controlling this factor with a transparent object-cache allowed us to get an overall performance improvement of around 10-20x in the case of a Postscript interpreter, taking performance from unacceptably slow up to and beyond the industry standard.
Of course, this isn't the only factor and Postscript interpretation not the only application.  Stay tuned!