The M1 Macs are out now, and not only does Apple claim they're absolutely smokin', early benchmarks
seem to confirm those claims. I don't find this surprising, Apple has been highly focused on
performance ever since Tiger, and as far as I can tell hasn't let up since.
One maybe somewhat surprising aspect of the M1s is the limitation to "only"
16 Gigabytes of memory. As someone who bought a 16 Kilobyte language card to run the Merlin
6502 assembler on his Apple ][+ and expanded his NeXT cube, which isn't that different from
a modern Mac, to a
whopping 16 Megabytes, this doesn't actually seem that much of a limitation, but it did
cause a bit of consternation.
I have a bit of a theory as to how this "limitation" might tie in to how Apple's outside-the-box
approach to memory and performance has contributed to the remarkable achievement that is the M1.
The M1 is apparently a multi-die package that contains both the actual processor die and the
DRAM. As such, it has a very high-speed interface between the DRAM and the processors.
This high-speed interface, in addition to the absolutely humongous caches, is key to keeping the various functional
units fed. Memory bandwidth and latency are probably the determining factors for many
of today's workloads, with a single access to main memory taking easily hundreds of clock cycles
and the CPU capable of doing a good number of operations in each of these clock cycles.
As Andrew Black wrote: "[..] computation is essentially free, because it happens 'in the cracks' between data fetch and data store; ..".
The tradeoff is that you can only fit so much DRAM in that package for now, but if it fits,
it's going to be super fast.
So how do we make sure it all fits? Well, where Apple might have been "focused" on performance
for the last 15 years or so, they have been completely anal about memory consumption.
When I was there, we were fixing 32 byte memory leaks. Leaks that happened once.
So not an ongoing consumption of 32 bytes again and again, but a one-time leak of 32 bytes.
That dedication verging on the obsessive is one of the reasons iPhones have been besting
top-of-the-line Android phone that have twice the memory. And not by a little, either.
Another reason is the iOS team's steadfast refusal to adopt tracing garbage collection as
most of the rest of the industry did,
and macOS's later abandonment of that technology in favor of the reference counting (RC) they've
been using since NeXTStep 4.0. With increased automation of those reference counting operations
and the addition of weak references, the convenience level for developers is essentially
indistinguishable from a tracing GC now.
The benefit of sticking to RC is much-reduced memory consumption. It turns out that for
a tracing GC to achieve performance comparable with manual allocation, it needs several
times the memory (different studies find different overheads, but at least 4x is a conservative
lower bound). While I haven't seen a study comparing RC, my personal experience is that the
overhead is much lower, much more predictable, and can usually be driven down with little
additional effort if needed.
So Apple can afford to live with more "limited" total memory because they need much less
memory for the system to be fast. And so they can do a system design that imposes this
limitation, but allows them to make that memory wicked fast. Nice.
Another "well-known" limitation of RC that has made it the second choice compared to tracing
GC is the fact that updating those reference counts all the time is expensive, particularly
in a multi-threaded environment where those updates need to be atomic. Well...
fun fact: retaining and releasing an NSObject takes ~30 nanoseconds on current gen Intel, and ~6.5 nanoseconds on an M1
We got that working on x86-64 too :) this further improvement is because uncontended acquire-release atomics are about the same speed as regular load/store on A14
Problem solved. I guess it helps if you can make your own Silicon ;-)
So Apple's focus on keeping memory consumption under control, which includes but is not limited
to going all-in on reference counting where pretty much the rest of the industry has adopted
tracing garbage collection, is now paying off in a majory way ("bigly"? Too soon?). They can get away
with putting less memory in the system, which makes it possible to make that memory really fast.
And that locks in an advantage that'll be hard to duplicate.
It also means that native development will have a bigger advantage compared to web technologies,
because native apps benefit from the speed and don't have a problem with the memory limitations,
whereas web-/electron apps will fill up that memory much more quickly.
One thing that you may have noticed last time around was that we were getting the instance variable names
from the class, but then also still manually setting the common keys manually. That's a bit of
duplicated and needlessly manual effort, because the common keys are exactly those ivar names.
However, the two pieces of information are in different places, the ivar names in the builder and
the common strings in the in the parse itself. One way of consolidating this information is by
creating a convenience intializer for decoding to objects as follows:
We still compute the ivar names twice, but that's not really such a big deal, so something we
can fix later, just like the issue that we should probably be using property names instead
of instance variable names that in the case of properties we have to post-process to get
rid of the underscores added by ivar synthesis.
With that, the code to parse to objects simplifies to the following, very
similar to what you would see in Swift with JSONDecoder.
So, quickly verifying that performance is still the same (always do this!) and...oops! Performance
dropped significantly, from 441ms to over 700ms. How could such an innocuous change lead to a
50% performance regression?
The profile shows that we are now spending significantly more time in MPWSmallStringTable's
objectForKey: method, where it gets the bytes out of the NSString/CFString,
but why that should be the case is a bit mysterious, since we changed virtually nothing.
A little further sleuthing revealed that the strings in question are now instances of NSTaggedPointerString,
where previously they were instances of __NSCFConstantString. The latter has a pointer to its
byte-oriented character orientation, which it can simply return, while the former cleverly encodes the
characters in the pointer itself, so it first has to reconstruct that byte representation. The method
of constructing that representation and computing the size of such a representation also appears to be
fairly generic and slow via a stream.
This isn't really easy to solve, since the creation of NSTaggedPointerStrring instances
is hardwired pretty deep in CoreFoundation with no way to disable this "optimization". Although it would
be possible to create a new NSString subclass with a byte buffer,
make sure to convert to that class before putting instances in the lookup table, that seems like
a lot of work. Or we could just revert this convenience.
Damn the torpedoes and full speed ahead!
Alternatively, we really wanted to get rid of this whole process of packing character data
into NSString instances just to immediately unpack them again, so let's
leave the regression as is and do that instead.
Where previously the builder had a NSString *key instance vaiable, it now has
a char *keyStr and a int keyLen. The string-handling case
in the JSON parser is now split betweeen the key and the non-key casse, with the non-key
case still doing the conversion, but the key-case directly sending the char*
and length to the builder.
This means that at least temporarily, JSON escape handling is disabled for keys. It's straightforward
to add back, makeRetainedJSONStringStart:length: does all its processing in a character
buffer, only converting to a string object at the very end.
If there is a key, we are in a dictionary, otherwise an array (or top-level). In the dictionary
case, we can now fetch the ValueAccessor via the OBJECTFORSTRINGLENGTH()
macro.
The results are encouraging: 299ms, or 147 MB/s.
The MPWPlistBuilder also needs to be adjusted: as it builds and NSDictionary
and not an object, it actually needs the NSString key, but the parser no longer
delivers those. So it just creates them on the fly:
Surprisingly, this makes the dictionary parsing code slightly faster, bringing up to par with
NSSJSSONSerialization at 421ms.
Eliminating NSNumber
Our use of NSNumber/CFNumber values is very similar to our use of
NSString for keys: the parser wraps the parsed number in the object, the
builder then unwraps it again.
Changing that, initially just for integers, is straightforward: add an integer-valued
message to the builder protocol and implement it.
The actual integer parsing code is not in MPWMASONParser but its superclasss, and as we don't
want to touch that for now, let's just copy-paste that code, modifying it to return a C primitive type
instead of an object.
-(long)longElementAtPtr:(const char*)start length:(long)len
{
long val=0;
int sign=1;
const char *end=start+len;
if ( start[0] =='-' ) {
sign=-1;
start++;
} else if ( start[0]=='+' ) {
start++;
}
while ( start < end && isdigit(*start)) {
val=val*10+ (*start)-'0';
start++;
}
val*=sign;
return val;
}
I am sure there are better ways to turn a string into an int, but it will do for now.
Similarly to the key/string distinction, we now special case integers.
if ( isReal) {
number = [self realElement:numstart length:curptr-numstart];
[_builder writeString:number];
} else {
long n=[self longElementAtPtr:numstart length:curptr-numstart];
[_builder writeInteger:n];
}
Again, not pretty, but we can clean it up later.
Together with using direct instance variable access instead of properties to get to
the accessorTable, this yields a very noticeable speed boost:
229 ms, or 195 MB/s.
Nice.
Discussion
What happened here? Just random hacking on the profile and replacing nice object-oriented programming
with ugly but fast C?
Although there is obviously some truth in that, profiles were used and more C primitive types appeared,
I would contend that what happened was a move away from objects, and particularly away from generic
and expensive Foundation objects ("Foundation oriented programming"?) towards
message oriented programming.
I'm sorry that I long ago coined the
term "objects" for this topic because it gets many people to focus on the
lesser idea.
The big idea is "messaging" -- that is what the kernal of Smalltalk/Squeak
is all about (and it's something that was never quite completed in our
Xerox PARC phase). The Japanese have a small word -- ma -- for "that which
is in between" -- perhaps the nearest English equivalent is "interstitial".
The key in making great and growable systems is much more to design how its
modules communicate rather than what their internal properties and
behaviors should be.
It turns out that message oriented programming (or should we call it
Protocol Oriented Programming?)
is where Objective-C shines: coarse-grained objects, implemented
in C, that exchange messages, with the messages also as primitive
as you can get away with. That was the idea, and when you
follow that idea, Objective-C just hums, you get not just
fast, but also flexible and architecturally nicely decoupled objects: elegance.
The combination of objects + primitive messages is very similar
to another architecturally elegant and productive style: Unix
pipes and filters. The components are in C and can have as rich
an internal structure as you want, but they have to talk to each
other via byte-streams. This can also be made very fast, and
also prevents or at least reduces coupling between the components.
Another aspect is the tension between an API for use and an
API for reuse, particularly within the constraints of call/return.
When you get tasked with "Create a component + API for parsing JSON", something
like NSJSONSerialization is something you almost have to
come up with: feed it JSON, out comes parsed JSON. Nothing could be
more convenient to use for "parsing JSON".
MPWMASONParser on the other hand is not convenient at all
when viewed in isolation, but it's much more capable of being smoothly
integrated into a larger processing chain. And most of the work that
NSJSONSerialization did in the name of convenience is
now just wasted, it doesn't make further processing any easier but
sucks up enormous amounts of time.
Anyway, let's look at the current profile:
First, times are now small enough that high-resolution (100µs) sampling is now necessary to get meaningful results.
Second, the NSNumber/CFNumber and NSString packing and unpacking is gone,
with an even bigger chunk of the remaining time now going to object creation. objc_msgSend() is now starting to
actually become noticeable, as is the (inefficient) character level parsing. The accessors of our test objects
start to appear, if barely.
With the work we've done so far, we've improved speed around 5x from where we started, and at 195 MB/s are almost
20x faster than Swift's JSONDecoder.
I can help not just Apple, but also you and your company/team with performance and agile coaching, workshops and consulting.
Contact me at info at metaobject.com.
Last time, we actually made some significant headway by taking advantage of the dematerialisation of the
plist intermediate representation. So instead of first producing an array of dictionaries, we went directly from
JSON to the final object representation.
This got us down from around 1.1 seconds to a little over 600 milliseconds.
It was accomplished by using the Key Value Coding method setValue:forKey: to directly set the attributes of the
objects from the parsed JSON. Oh, and instantiating those objects in the first place, instead of dictionaries.
That this should be so much faster than most other methods, for example beating Swift's JSONDecoder() by a cool
7x, is a little surprising, given that KVC is, as I mentioned in the first article of the series, the slowest mechanism
for getting data in and out of objcets short of deliberate Rube Goldber Mechanisms.
Key-value coding is a data access mechanism in which the properties of an object are accessed indirectly by key or name, rather than directly as fields or by invocation of accessor methods. It is used throughout Enterprise Objects but is perhaps most useful to you when accessing data in relationships between enterprise objects.
Key-value coding enables the use of keypaths to traverse relationships. For example, if a Person entity has a relationship called toPhoto whose destination entity (called PersonPhoto) contains an attribute called photo, you could access that data by traversing the keypath toPhoto.photo from within a Person object.
Keypaths are just one way key-value coding is an invaluable feature of Enterprise Objects. In general, though, it is most useful in providing a consistent way to access an object's data. Rather than needing to know if an object's data members have accessor methods, what the names of those accessor methods are, or if the data is accessible through fields, all you need to know are the keys that represent an object’s data. Key-value coding automatically finds the data, regardless of how the object provides its data. In this context, key-value coding satisfies the classic design pattern principle to “encapsulate the things that varies.”
It still is an extremely powerful programming technique that lets us write algorithms that work generically with any
object properties, and is currently the basis for CoreData, AppleScript support, Key Value Observing and Bindings.
(Though I am somewhat skeptical of some of these, not least for performance reasons, see The Siren Call of KVO and (Cocoa) Bindings). It was
also part of the inspiration for Polymorphic Identifiers.
The core of KVC are the valueForKey: and setValue:forKey: messages, which have default
implementations in NSObject. These default implementations take the NSString key,
derive an accessor message from that key and then send the message, either setting or returning a value.
If the value that the underlying message takes/returns is a non-object type, then KVC wraps/unwraps as necessary.
If this sounds expensive, then that's because it is. To derive the set accessor from the key, the first character
of the key has to be capitalized, the the string "set" prepended and the string converted to an Objective-C selector
(SEL). In theory, this has to be done on every call to one of the KVC methods, and it has to be done
with NSString objects, which do a fantastic job of representing human-visible text, but are a bit
heavy-weight for low-level work.
Doing the full computation on every invocation would be way too expensive, so Apple caches some of the intermediate
results. As there is no obvious place to put those intermediate results, they are placed in global hash tables,
keyed by class and property/key name. However, even those lookups are still significantly more expensive than the
final set or get property accesss, and we have to do multiple lookups. Since theses tables have to be global,
locking is also required.
ValueAccessor
All this expense could be avoided if we had a custom object to mediate the access, rather than a naked
NSString. That object could store those computed values, and then provide fast and
generic access to arbitrary properties. Enter MPWValueAccesssor (.h.m).
A word of warning: unlike MPWStringtable, MPWValueAccesssor is mostly experimental code.
It does have tests and largely works, but it is incomplete in many ways and also contains a bunch of extra and
probably extraneous ideas. It is sufficient for our current purpose.
The core of this class is the AccessPathComponent struct.
typedef struct {
Class targetClass;
int targetOffset;
SEL getSelector,putSelector;
IMP0 getIMP;
IMP1 putIMP;
id additionalArg;
char objcType;
} AccessPathComponent;
This struct contains a number of different ways of getting/setting the data:
the integer offset into the object where the ivar is located
a pair of Objective-C selectors/message names, one for getting, one for setting.
a pair of function pointers to the Objective-C methods that the respective selectors resolve to
the additional arg is the key, to be used for keyed access
The getIMP and putImp are initialized to objc_msgSend(), so they
can always be used. If we bind the ValueAccessor to a class, those function pointers get
resolved to the actual getter/setter methods. In addition the objcType gets set to the
type of the instance variable, so we can do automatic conversions like KVC. (This was some code
I actually had to add between the last instalment and the current one.)
The key takeaway is that all the string processing and lookup that KVC needs to do on every call is
done once during initialization, after that it's just a few messages and/or pre-resolved
function calls.
Hooking up the ValueAccessor
Adapting the MPWObjectBuilder (.h .m) to use MPWValueAccessor was much
easier than I had expected. Thee following shows the changes made:
The bulk of the changes come as part of the new -setupAccessors: method. It first asks the
class what its instance variables are, creates a value accessor for that instance variabl(-name),
binds the accessor to the class and finally puts the accessors in a lookup table keyed by name.
The -writeObject:forKey: method is modified to look up and use a value accessor instead
of using KVC.
Results
The parsing driver code didn't have to be changed, re-running it on our non-representative 44 MB JSON file
yields the following time:
441 ms.
Now we're really starting to get somewhere! This is just shy of 100 MB/s and 10x faster then Swift's
JSONDecoder, and within 5% of raw NSJSONSerialization.
Analysis and next steps
Can we do better? Why yes, glad you asked. Let's have a look at the profile.
First thing to note is that object-creation (beginDictionary) is now the #1 entry under the parse,
as it should be. This is another indicator that we are not just moving in the right direction, but also
closing in on the endgame.
However, there is still room for improvement. For example, although actually searching the SmallStringTable for
the ValueAccessor (offsetOfCStringWithLengthInTableOfLength()) takes only 2.7% of the time, about
the same as getting the internal char* out of a CFString via the fast-path
(CFStringGetCStringPtr()), the total time for the -objectForKey: is a multiple of
that, at 13%. This means that unwrapping the NSString takes more time than doing the actual
work. Wrapping the char* and length into an NSString also takes significant time,
and all of this work is redundant...we would be better of just passing along the char* and length.
A similar wrap/unwrap situation occurs with integers, which we first turn into NSNumbers, only to
immediately get the integer out again so we can set it.
objc_msgSend() also starts getting noticeable, so looking at a bit of IMP-caching and just eliminating
unnecessary indirection also seems like a good idea.
That's another aspect of optimization work: while the occasional big win is welcome, getting to truly
outstanding performance means not being satisfied with that, but slogging through all the small-ish
seeming detail.
Note
I can help not just Apple, but also you and your company with performance and agile coaching, workshops and consulting.
Contact me at info at metaobject.com.
After initially disappointing results trying to get to faster JSON processing (parsing, for now), we
finally got parity with NSJSONSerialization, more or less, in the last instalment, with the
help of MPWSmallStringTable to unique our strings before turning them into
objects, string creation being surprisingly expensive even for tagged pointer strings.
Cutting out the Middleman: ObjectBuilder
In the first instalment of this series, we saw that we could fairly trivially
create objects from the plist created by NSJSONSerialization.
MPWObjectBuilder (.h.m) is a subclass of MPWPlistBuilder that changes just
a few things: instead of creating dictionaries, it creates objects, and instead of using
-setObject:forKey: to set values in that dictionary, it uses the KVC message
-setValue:forKey: (vive la petite différence!) to set values in that object.
That's it! Well, all that need concern us for now, the actual class has some additional
features that don't matter here. The _tos instance variable is the top
of a stack that MPWPlistBuilder maintains while constructing the result.
The MPWObjectCache is just a factory for creating objects.
Not the most elegant code in the universe, and not a complete parser by an stretch of the
imagination, but workable.
Result: 621 ms.
Not too shabby, only 50% slower than baseNSJSONSerialization on our non-representative 44MB JSON file,
but creating the final objects, instead of just the intermediate representation, and arround 7x faster than Apple's JSONDecoder.
Although still below 100 MB/s and nowhere near 2.5 GB/s we're also starting to close in on the performance level
that should be achievable given the context, with 140ms for basic object creation and 124ms for a mostly empty parse.
Analysis and next steps
Ignoring such trivialities as actually being useful for more than the most constrained situations
(array of single kind of object), how can we improve this? Well, make it
faster, of course, so let's have a look at the profile:
As expected, the KVC code is now the top contributor, with around 40% of total runtime.
(The locking functions that show up as siblings of -setValue:forKey: are almost
certainly part of that implementation, this slight misattribution of times is something you
should generally expect and be aware of with Instruments. I am guessing it has to do with missing frame-pointers
(-fomit-frame-pointer) but don't really feel any deep urge to investigate, as it doesn't
materially impact the outcome of the analysis.
I guess that's another point: gather enough data to inform your next step, certainly no less, but also no more.
I see both mistakes, the more common one definitely being making things "fast" without enough data. Or any,
for that matter. If I had a €uro for every project that claims high performance without any (comparative)
benchmarking, simply because they did something the authors think should be fast, well, you know, ....
The other extreme is both less common and typically less bad, as at least you don't get the complete
nonsense of performance claims not backed by any performance testing, but running a huge battery of
benchmarks on every step of an optimization process is probably going to get in the way of achieving
results, and yes, I've seen this in practice.
In the previous in instalments, we looked at and analysed the status quo for JSON parsing on Apple platforms in general and Swift in particular and it wasn't all that promising: we know
that parsing to an intermediate representation of Foundation plist types (dictionaries, arrays,
strings, numbers) is one of the worst possible ideas, yet it is the fastest we have. We know
that creating objects from JSON is, or at least should be, the slowest part of this, yet it
is by far the fastest, and last, not least, we also know is the slowest possible way
to transfer values to those objects, yet Swift Coding somehow manages to be several times slower.
So either we're wrong about all of these things we know, always a distinct possibility, or there is
something fishy going on. My vote is on the latter, and while figuring out exactly what
fishy thing is going on would probably be a fascinating investigation for an Apple performance
engineer, I prefer proof by creation:
Just make something that doesn't have these problems. In that case you not only know
where the problem is, you also have a better alternative to use.
MASON
Without much further ado, here is the definition of the MPWMASONParser class:
What it does is send messages of the MPWPlistStreaming protocol to
its builder property. So a Message-oriented parser for JaSON,
just like MAX is the Message oriented API for XML.
The implementation-history is also reflected in the fact that it is a subclass of
MPWXmlAppleProplistReader, which itself is a subclass of
MPWMAXParser>.
The core of the implementation is a loop that handles JSON syntax and sends one-way messages for the
different elements to the builder. It looks very similar to loops in other simple parsers (and probably not at all like the crazy SIMD contortioins of simdjson). When done, it returns whatever the builder constructed.
-parsedData:(NSData*)jsonData
{
[self setData:jsonData];
const char *curptr=[jsonData bytes];
const char *endptr=curptr+[jsonData length];
const char *stringstart=NULL;
NSString *curstr=nil;
while (curptr < endptr ) {
switch (*curptr) {
case '{':
[_builder beginDictionary];
inDict=YES;
inArray=NO;
curptr++;
break;
case '}':
[_builder endDictionary];
curptr++;
break;
case '[':
[_builder beginArray];
inDict=NO;
inArray=YES;
curptr++;
break;
case ']':
[_builder endArray];
curptr++;
break;
case '"':
parsestring( curptr , endptr, &stringstart, &curptr );
curstr = [self makeRetainedJSONStringStart:stringstart length:curptr-stringstart];
curptr++;
if ( *curptr == ':' ) {
[_builder writeKey:curstr];
curptr++;
} else {
[_builder writeString:curstr];
}
break;
case ',':
curptr++;
break;
case '-':
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
{
BOOL isReal=NO;
const char *numstart=curptr;
id number=nil;
if ( *curptr == '-' ) {
curptr++;
}
while ( curptr < endptr && isdigit(*curptr) ) {
curptr++;
}
if ( *curptr == '.' ) {
curptr++;
while ( curptr < endptr && isdigit(*curptr) ) {
curptr++;
}
isReal=YES;
}
if ( curptr < endptr && (*curptr=='e' | *curptr=='E') ) {
curptr++;
while ( curptr < endptr && isdigit(*curptr) ) {
curptr++;
}
isReal=YES;
}
number = isReal ?
[self realElement:numstart length:curptr-numstart] :
[self integerElementAtPtr:numstart length:curptr-numstart];
[_builder writeString:number];
break;
}
case 't':
if ( (endptr-curptr) >=4 && !strncmp(curptr, "true", 4)) {
curptr+=4;
[_builder pushObject:true_value];
}
break;
case 'f':
if ( (endptr-curptr) >=5 && !strncmp(curptr, "false", 5)) {
// return false;
curptr+=5;
[_builder pushObject:false_value];
}
break;
case 'n':
if ( (endptr-curptr) >=4 && !strncmp(curptr, "null", 4)) {
[_builder pushObject:[NSNull null]];
curptr+=4;
}
break;
case ' ':
case '\n':
while (curptr < endptr && isspace(*curptr)) {
curptr++;
}
break;
default:
[NSException raise:@"invalidcharacter" format:@"JSON invalid character %x/'%c' at %td",*curptr,*curptr,curptr-(char*)[data bytes]];
break;
}
}
return [_builder result];
}
It almost certainly doesn't correctly handle all edge-cases, but doing so is unlikely to impact
overall performance.
Dematerializing Property Lists with MPWPlistStreaming
Above, I mentioned that MASON is message-oriented, and that its main
purpose is sending messages of the MPWPlistStreaming protocol to its
builder. Here is that protocol:
What this enables is using property lists as an intermediate format without actually
instantiating them, instead sending the messages we would have sent if we had a
property list. Protocol Oriented Programming, anyone? Oh, I forgot, you can only
do that in Swift...
The same protocol can also be used on the output side, then you get something like
Standard Object Out.
Trying it out
By default, MPWMASONParser sets its builder to an instance of
MPWPlistBuilder, which, as the name hints, builds property lists.
Just like NSJSONSerialization.
Hmm...that's disappointing. We didn't do anything wrong, yet almost 50% slower
than NSJSONSerialization. Well, those dang Apple engineers do
know what they're doing after all, and we should probably just give up.
Well, not so fast. Let's at least check out what we did wrong. Unleash
the Cracken...er...Instruments!
So that's interesting: the vast majority of time is actually spent in Apple code building the plist.
And we have to build the plist. So how does NSJSONSerialization get the same
job done faster? Last I checked, with NSPropertyListSerialization, but close enough,
they actually use specialised CoreFoundation-based dictionaries that
are optimized for the case of having a lot of string keys and having them all in one place
during initialization. These are not exposed, CoreFoundation being C-based means non-exposure
is very effective and apparently Apple stopped open-sourcing CFLite a while ago.
In Part 1: The Status Quo, we
saw that something isn't quite right with JSON procsesing in Apple land: while something like simdjson can accomplish
the basic parsing task at a rate of 2.5 GB/s and creating objects happens at an equivalent rate of 310 MB/s, Swift's
JSON Codable support manages a measly 10 MB/s, underperforming the MacBook Pro's built in SSD by at least 200x and
a Gigabit network connection still by factor 10.
Some of the feedback I got indicated that the implications of the data presented in "Status Quo" were not as clear
as they should have been, so a little analysis before we dive into code.
The MessagePack decode is the only "pure" Swift Codable decoder. As it is so slow as to make the rest of the graph almost
unreadable and was only included for comparison, not actually being a JSON decoder, let's leave it out
for now. In addition, let's show how much time of each result is the underlying parser and how much time is spent in
object creation.
This chart immediately lays to rest two common hypotheses for the performance issues of Swift Codable:
It's the object creation.
No.
That is, yes, object creation is slow compared to many other things, but here
it represents only around 3% of the total runtime. Yes, finding a way to reduce that final 3% would also be
cool (watch this space!), but how about tackling the 97% first?
It's the fact that it is using NSJSONSerialization and therefore Objective-C under the hood that makes it slow.
No.
Again, yes, parsing something to a dictionary-based representation that is more expensive than the
final representation is not ideal and should be avoided. This is one of the things we will be doing. However:
The NSJSONSerialization part of decoding makes up only 13% of the running time, the
remaining 87% are in the Swift decoder part.
Turning the dictionaries into objects using Key-Value-Coding, which to me is just about the slowest
imaginable mechanism for getting data into an object that's not deliberately adding Rube-Goldberg
elements, "only" adds 740ms to the basic NSJSONSerialization's parse from JSON to
dictionaries. While this is ~50% more time than the parse to dictionaies and 5x the pure object
creaton time, it is still 5x less than the Codable overhead.
All the pure Swift parsers are also this slow or slower.
It also shows that stjson is not a contender (not that it ever claimed to be), because it is slower than even
Swift's JSONDecoder without actually going to full objects. JASON is significantly faster, but also doesn't
go to objects, and for not going to objects is still significantly slower than NSJSONSerialization.
That really only leaves the NSJSONSerialization variants as useful comparison
points for what is to come, the rest is either too slow, doesn't do what we need it to do, or both.
Here we can see fairly clearly that creating objects instead of dictionaries would be better. Better than
creating dictionaries and certainly much better than first creating dictionaries and then objects,
as if that weren't obvious. It is also clear that the actual parsing of JSON text doesn't add all that
much extra overhead relative to just creating the dictionaries. In fact, just adding the -copy to
convert from mutable dictionaries to immutable dictionaries appears to take more time than the parse!
In truth, it's actually not quite that way, because as far as I know, NSJSONSerialization, like
its companion NSPropertyListSerialization uses special dictionaries that are cheaper to
create from a textual representation.
simdjson
With all that in mind, it should be clear that simdjson, although it would likely take the pure parse time
for that down to around 17 ms, is not that interesting, at lest at this stage. What it optimizes is the part that
already takes the least time, and is already overwhelmed by even small changes in the way we create our
objects.
What this also means is that simdjson will only be useful if it doesn't make object creation slower. This is
also a lesson I learned when creating the MAX XML parser: you can't just make the XML parser part as fast
as possible, sometimes it makes sense to make the parser itself somewhat slower if that means other parts,
such as object creation, significantly faster. Or more generally: it's not enough to have fast components,
they have to play well together. Optimization is about systems and architecture. If you want to do it well.
MASON
In the next installment, we will start looking at the actual parser.
I just finished watching Daniel Lemire's talk on the current iteration of simdjson, a JSON parser that clocks in at 2.5GB/s! I've been following Daniel's work for some time now and can't really recommend it highly enough.
This reminded me of a recent twitter conversation where I had offered to contribute a fast, Swift-compatible JSON parser loosely based on MAX, my
fast and convenient XML parser. Due to various factors most of which are not under my control, I can't really offer anything that's
fast when compared to simdjson, but I can manage something quite a bit less lethargic than what's currently on offer
in the Apple and particularly the Swift world.
Environmental assumptions and constraints
My first assumption is that we are going to operate in the Apple ecosystem, and for simplicity's sake I am going to use macOS.
Next, I will assume that what we want from our parse(r) are domain objects for further processing within our application
(or structs, the difference is not important in this context).
We are going to use the following class with a mix of integer and string instance variables, in Swift:
@objc class TestClass: NSObject, Codable {
let hi:Int
let there:Int
let comment:String
...
}
To make it all easy to measure, we are going to use one million objects, which we are going to initialise with increasing integers and the constant string "comment". This yields the same 44MB JSON file with different serialisation methods, which can be correctly parsed by all the parsers tested. This is obviously a very simple class an file structure, but I think it gives a reasonable approximation for real-world use.
The first thing to check is how quickly we can create these objects straight in code, without any parsing.
That should give us a good upper
bound for the performance we can achieve when parsing to domain objects.
#define COUNT 1000000
-(void)createObjects
{
NSMutableArray *objResult=[NSMutableArray arrayWithCapacity:COUNT+20];
for ( int i=0;i<COUNT;i++ ) {
TestClass *cur=[TestClass new];
cur.hi=i;
cur.there=i;
cur.comment=@"comment";
[objResult addObject:cur];
}
NSLog(@"Created objects in code w/o parsing %@ with %ld objects",objResult[0],[objResult count]);
}
On my Quad Core, 2.7Ghz MBP '18, this runs in 0.141 seconds. Although we aren't actually parsing, it would mean that
just creating all the objects that would result from parsingg our 44MB JSON file would yield a rate of 312 MB/s.
Wait a second! 312MB/s is almost 10x slower than Daniel Lemire's parser, the one that actually parses JSON, and we are only
creating the objects that would result if we were parsing, without doing any actual parsing.
This is one of the many unintuitive aspects of parsing performance: the actual low-level, character-level parsing is generally the
least important part for overall performance. Unless you do something crazy like use NSScanner. Don't use NSScanner. Please.
One reason this is unintuitive is that we all learned that performance is dominated by the innermost loop, and character level processing
is the innermost loop. But when you have magnitudes in performance differences and inner and outer loops
differ by less than that amount, the stuff happennnig in the outer loop can dominate.
NSJSONSerialization
Apple's JSON story very much revolves around NSJSONSerialization, very much like most of the rest of
its serialization story revolves around the very similar NSPropertyListSerialization class. It has
a reasonable quick implementation, turning the 44 MB JSON file into an NSArrray of NSDictionary
instances in 0.421 seconds when called from Objective-C, for a rate of 105 MB/s. From Swift, it takes 0.562 seconds, for 78 MB/s.
Of course, that gets us to a property list (array of dicts, in this case), not to the domain objects we actually want.
If you read my book (did I mention my book? Oh, I think I did), you will know that this type of dictonary
representation is fairly expensive: expensive to create, expensive in terms of memory consumption and
expensive to access. Just creating dictionaries equivalent to the objects we created before takes 0.321 seconds,
so around 2.5x the time for creating the equivalent objects and a "rate" of 137 MB/s relative to our 44 MB JSON file.
-(void)createDicts
{
NSMutableArray *objResult=[NSMutableArray arrayWithCapacity:COUNT+20];
for ( int i=0;i<COUNT;i++ ) {
NSMutableDictionary *cur=[NSMutableDictionary dictionary];
cur[@"hi"]=@(i);
cur[@"there"]=@(i);
cur[@"comment"]=@"comment";
[objResult addObject:cur];
}
NSLog(@"Created dicts in code w/o parsing %@ with %ld objects",objResult[0],[objResult count]);
}
Creating the dict in a single step using a dictionary literal is not significantly faster, but creating
an immutable copy of the mutable dict after we're done filling brings the time to half a second.
Getting from dicts to objects is typically straightforward, if tedious: just fetch the entry of the dictionary
and call the corresponding setter with the value thus retrieved from the dictionary. As this isn't production
code and we're just trying to get some bounds of what is possible, there is an easier way: just use Key Value
Coding with the keys found in the dictionary.
The combined code, parsing and then creating the objects is shown below:
-(void)decodeNSJSONAndKVC:(NSData*)json
{
NSArray *keys=@[ @"hi", @"there", @"comment"];
NSArray *plistResult=[NSJSONSerialization JSONObjectWithData:json options:0 error:nil];
NSMutableArray *objResult=[NSMutableArray arrayWithCapacity:plistResult.count+20];
for ( NSDictionary *d in plistResult) {
TestClass *cur=[TestClass new];
for (NSString *key in keys) {
[cur setValue:d[key] forKey:key];
}
[objResult addObject:cur];
}
NSLog(@"NSJSON+KVC %@ with %ld objects",objResult[0],[objResult count]);
}
Note that KVC is slow. Really slow. Order-of-magnitude slower than just sending messages kind of slow, and so it has significant impact on the total time, which comes to a total of 1.142 seconds including parsing and object creation,
or just shy of 38 MB/s.
Swift JSON Coding
For the first couple of releases of Swift, JSON support by Apple was limited to a wrapped NSJSONSerialization, with the slight
performance penalty already noted. As I write in my book (see sidebar), many JSON "parsers" were published, but none of these
with the notable exception of the Big Nerd Ranch's Freddy were actual parses, they all just transformed the
arrays and dictionaries returned by NSJSONSerialization into Swift objects. Performance was
abysmal, with around 25x overhead in addition to the basic NSJSONSerialization parse.
Apple's Swift Codable promised to solve all that, and on the convenience front it certainly does
a great job.
func readJSONCoder(data:Data) -> [TestClass] {
NSLog("Swift Decoding")
let coder=JSONDecoder( )
let array=try! coder.decode([TestClass].self, from: data)
return array
}
(All the forcing is because this is just test code, please don't do this in production!). Alas, performance is
still not great: 4.39 seconds, or 10 MB/s. That's 10x slower than the basic NSJSONSerialization
parse and 4x slower than our slow but simple complete parse via NSJSONSerialization and KVC.
However, it is significantly faster than the previous third-party JSON to Swift objects "parsers", to
the tune of 3-4x. This is the old "first mark up 400% then discount 50%" sales trick applied to performance,
except that the relative numbers are larger.
Third Party JSON Parsers
I looked a little at third party JSON parsers, particularly JASON, STJSON and ZippyJSON.
STTJSON does not make any claims to speed and manages to clock in at 5 seconds, or just under 10 MB/s. JASON bills
itself as a "faster" JSON parser (they compare to SwiftyJSON), and does reasonably well at 0.75 seconds or 59 MB/s.
However both of these parse to their own internal representation, not to domain objects (or structs), and so should
be compared to NSJSONSerialization, at which point they both disappoint.
Probably the most interesting of these is ZippyJSON, as it uses Daniel Lemire's simdjson and is Codable
compatible. Alas, I couldn't get ZippyJSON to compile, so I don't
have numbers, but I will keep trying. They claim around 3x faster than Apple's JSONDecoder, which
would make it the only parser to be at least in the same ballpark as the trivial NSJSONSerialization + KVC method I showed above.
Another interesting tidbit comes from ZippyJSON's README, under the heading "Why is it so much faster".
Apple's version first converts the JSON into an NSDictionary using NSJSONSerialization and then afterwards makes things Swifty. The creation of that intermediate dictionary is expensive.
This is true by itself: first converting to an intermediate representation is slow, particularly one
that's as heavy-weight as property lists. However, it cannot be the primary reason, because creating that
expensive representation only takes 1/8th of the total running time. The other 7/8ths is Codable apparently
talking to itself. And speaking very s-l-o-w-l-y while doing that.
To corroborate, I also tried a the Flight-School implementation of Codable for MessagePack, which obviously does not use NSJSONSerialization.
It makes no performance claims and takes 18 seconds to decode the same
objects we used in the JSON files, of course with a different file that's 34 MB in size. Normalized to our 44 MB
file that would be 2.4 MB/s.
MAX and MASON
So where does that leave us? Considering what simdjs shows is theoretically possible with JSON parsing, we are
not in a good place, to put it mildly. 2.5 GB/s vs. 10 MB/s with Apple's JSONDecoder, several times slower than
NSJSONSerialization, which isn't exactly a speed daemon and around 30x slower than pure object creation. Comically bad might be another way of putting it. At least we're being entertained.
What can I contribute? Well, I've been through most of this once before with XML and the result was/is
MAX (Messaging API for XML), a parser that is not just super-fast itself (though no SIMD), but also
presents APIs that make it both super-convenient and also super-fast to go directly from the XML to
an object-representation, either as a tree or a stream of domain objects while using mostly constant
memory. Have I mentioned my book? Yeah, it's in the book, in gory detail.
Anyway, XML has sorta faded, so the question was whether the same techniques would work for a JSON parser.
The answer is yes, roughly, though with some added complexity and less convenience because JSON is a
less informative file format than XML. Open- and close-tags really give you a good heads-up as to what's
coming that "{" just does not.
The goal will be to produce domain objects at as close to the theoretical maximum of slightly more than 300 MB/s
as possible, while at the same time making the parser convenient to use, close to Swift Codable in convenience.
It won't support Codable per default, as the overheads seem to be too high, but ZippyJSON suggests that an
adapter wouldn't be too hard.
That parser is MPWMASONParser,
and no, it isn't done yet. In its initial state, it parses JSON to dictionaries in 0.58 seconds, or 76 MB/s and
slightly slower than NSJSONSerialization.
So we have a bit of way to go, come join me on this little parsing performance journey!
One of the goals I am aiming for in Objective-Smalltalk is instant builds and
effective live programming.
A month ago, I got a package from an old school friend: my old Apple ][+, which I thought I had given as a gift, but he insisted had been a long-term loan. That machine featured 48KB of DRAM and a 1 MHz, 8 bit 6502 processor that took multiple
cycles for even the simplest instructions, had no multiply instructions and almost no registers. Yet, when I turn it on it becomes interactive faster than the CRT warms up, and the programming experience remains fully interactive after that. I type something in, it executes. I change the program, type "RUN" and off it goes.
Of course, you can also get that experience with more complex systems, Smalltalk comes to mind, but the point is that
it doesn't take the most advanced technology or heroic effort to make systems interactive, what it takes is making it a priority.
Didn't the build time continuous increase over the year? Build time at my work jump 2.5x to almost an hour in 3 years (Granted, it's a 2014 Mac mini, but still) Even a iMac Pro takes 8 minutes now 🤦♂️
Now Swift is only one example of this, it's a current trend, and of course these systems do claim that they
provide benefits that are worth the wait. From optimizations to static type-checking with type-inference,
so that "once it compiles, it works". This is deemed to be (a) 100% worthwhile despite the fact that there
is no scientific evidence backing up these claims (a paper which claimed that it had the evidence was just
shredded at this year's OOPSLA) and (b) essentially cost-free. But of course it isn't cost free:
Minimum Viable Program:
"A running program, even if not correct, feels closer to working than a program that doesn't run at all"
So when everyone zigs, I zag, it's my contrarian nature. Where Swift's message was, essentially "there is
too much Smalltalk in Objective-C", my contention is that there is too little Smalltalk
in Objective-C (and also that there is too little "Objective" in Smalltalk, but that's a different
topic).
Smalltalk was perfectly interactive in its own environment on high end late 70s and early 80s
hardware. With today's monsters of computation, there is no good reason, or excuse
for that matter, to not be interactive
even when taken into the slightly more demanding Unix/macOS/iOS development
world. That doesn't mean there aren't loads of reasons, they're just not any good.
So Objective-Smalltalk will be fast, it will be live or near-live at all times,
and it will have instant builds. This isn't going to be rocket science, mostly, the ingredients are as follows:
An interpreter
Late binding
Separate compilation
A fast and simple native compiler
Let's look at these in detail.
An interpreter
The basic implementation of Objective-Smalltalk is an AST-walking interpreter. No JIT, not even a
simple bytecode interpreter. Which is about as
pessimal as possible, but our machines are so incredibly fast, and a lot of our tasks simple enough or computational steering enough that it actually does a decent enough job
for many of those tasks. (For more on this dynamic, see The Death of Optimizing Compilers by
Daniel J. Bernstein)
And because it is just an interpreter, it has no problems doing its thing on iOS:
(Yes, this is in the simulator, but it works the same on an actual device)
Late Binding
Late binding nicely decouples the parts of our software. This means that the compiler has very little
information about what happens and can't help a lot in terms of optimization or checking, something
that always drove the compiler folks a little nuts ("but we want to help and there's so much we could
do"). It enables strong modularity and separate compilation.
Objective-Smalltalk is as late-bound in its messaging as Objective-C or Smalltalk are, but goes beyond
them by also late-binding identifiers, storage and dataflow with Polymorphic Identifiers (ACM, pdf), Storage
Combinators (ACM, pdf) and Polymorphic Write Streams (ACM, pdf).
Allowing this level of flexibility while still not requiring a Graal-level Helden-JIT to burn
away all the abstractions at runtime will require careful design of the meta-level boundaries,
but I think the technically desirable boundaries align very well with the conceptually desirable
boundaries: use meta-level facilities to define the language you want to program in, then write
your program.
It's not making these boundaries clear and freely mixing meta-level and base-level programming
that gets us in not just conceptual trouble, but also into the kinds of technical trouble
that the Heldencompilers and Helden-JITs have to bail us out of.
Separate Compilation
When you have good module boundaries, you can get separate compilation, meaning a change in file
(or other code-containing entity if you don't like files) does not require changes to other files.
Smalltalk had this. Unix-style C programming had this, and the concept of binary libraries (with
the generalization to frameworks on macOS etc.). For some reason, this has taken more and more
of a back-seat in macOS and iOS development, with full source inclusion and full builds becoming
the norm in the community (see CocoaPods) and for a long time being enforced by Apple by not
allowing user-define dynamic libraries on iOS.
While Swift allows separate compilation, this can have such severe negative effects on both performance
and compile times that compiling everything on any change has become a "best practice". In fact, we
now have a build option "whole module optimization with optimizations turned off" for debugging. I
kid you not.
Objective-Smalltalk is designed to enable "Framework-oriented-programming", so separate compilation
is and will remain a top priority.
A fast and simple native compiler
However, even with an interpreter for interactive adjustments, separate compilation due to
good modularity and late binding, you sometimes want to do a full build, or need to rebuild
a large part of the codebase.
Even that shouldn't take forever, and in fact it doesn't need to. I am totally with Jonathan
Blow on this subject when he says that compiling a medium size project shouldn't really more
than a second or so.
My current approach for getting there is using TinyCC's backend as the starting point of the backend for Objective-Smalltalk. After all, the semantics are (mostly) Objective-C and Objective-C's semantics are just C. What I really like about tcc is that it goes so brutally directly to outputting
CPU opcode as binary bytes.
No layers of malloc()ed intermediate representations here! This aligns very nicely with
the streaming/messaging approach to high-performance I've taken elsewhere with
Polymorphic Write Streams (see above), so I am pretty confident I can make this (a) work
and (b) simple/elegant while keeping it (c) fast.
How fast? I obviously don't know yet, but tcc is a fantastic starting point. The following is the current (=wrong) ObjectiveTcc code to drive tcc to build a function that sends a single message:
How often can I do this in one second? On my 2018 high spec but 13" MBP: 300,000 times.
Including in-memory linking (though not much of that happening in this example), not including Mach-O generation as that's not implemented yet and writing the whole shebang to disk. I don't
anticipate either of these taking appreciably additional time.
If we consider this 2 "lines" of code, one for the function/method header and one for the message, then we can generate binary for 600KLOC/s.
So having a medium size program compile and link in about a second or so seems eminently doable,
even if I manage to slow the raw Tcc performance down by about an order of magnitude.
(For comparison: the Swift code base that motivated the Rome caching system for Carthage was
clocking in at around 60 lines per second with the then Swift compiler. So even with an
anticipated order of magnitude slowdown we'd still be 1000x faster. 1000x is good enough,
it's the difference between 3 seconds and an hour.)
What's the downside? Tcc doesn't do a lot of optimization. But that's OK as (a) the
sorts of optimizations C compilers and backends like LLVM do aren't much use for
highly polymorphic and late-bound code and (b) the basics get you around 80% of the
way (c) most code doesn't need that much optimization (see above) and (d) machines
have become really fast.
And it helps that we aren't doing crazy things like initially allocating function-local
variables on the heap or doing function argument copying via vtables that require
require leaning on the optimizer to get adequate performance (as in: not 100x slower..).
Defense in Depth
While any of these techniques might be adequate some of the time, it's the combination
that I think will make the Objective-Smalltalk tooling a refreshing, pleasant and
highly productive alternative to existing toolchains, because it will be
reliably fast under all circumstances.
And it doesn't really take (much) rocket science, just a willingness to make this
aspect a priority.