Tuesday, April 14, 2020

Somewhat Less Lethargic JSON Support for iOS/macOS, Part 3: Dematerialization

In the previous in instalments, we looked at and analysed the status quo for JSON parsing on Apple platforms in general and Swift in particular and it wasn't all that promising: we know that parsing to an intermediate representation of Foundation plist types (dictionaries, arrays, strings, numbers) is one of the worst possible ideas, yet it is the fastest we have. We know that creating objects from JSON is, or at least should be, the slowest part of this, yet it is by far the fastest, and last, not least, we also know is the slowest possible way to transfer values to those objects, yet Swift Coding somehow manages to be several times slower.

So either we're wrong about all of these things we know, always a distinct possibility, or there is something fishy going on. My vote is on the latter, and while figuring out exactly what fishy thing is going on would probably be a fascinating investigation for an Apple performance engineer, I prefer proof by creation:

Just make something that doesn't have these problems. In that case you not only know where the problem is, you also have a better alternative to use.

MASON

Without much further ado, here is the definition of the MPWMASONParser class:
@class MPWSmallStringTable;
@protocol MPWPlistStreaming;

@interface MPWMASONParser : MPWXmlAppleProplistReader {
	BOOL inDict;
	BOOL inArray;
	MPWSmallStringTable *commonStrings;
}

@property (nonatomic, strong) id  builder;

-(void)setFrequentStrings:(NSArray*)strings;

@end

What it does is send messages of the MPWPlistStreaming protocol to its builder property. So a Message-oriented parser for JaSON, just like MAX is the Message oriented API for XML.

The implementation-history is also reflected in the fact that it is a subclass of MPWXmlAppleProplistReader, which itself is a subclass of MPWMAXParser>. The core of the implementation is a loop that handles JSON syntax and sends one-way messages for the different elements to the builder. It looks very similar to loops in other simple parsers (and probably not at all like the crazy SIMD contortioins of simdjson). When done, it returns whatever the builder constructed.


-parsedData:(NSData*)jsonData
{
	[self setData:jsonData];
	const char *curptr=[jsonData bytes];
	const char *endptr=curptr+[jsonData length];
	const char *stringstart=NULL;
	NSString *curstr=nil;
	while (curptr < endptr ) {
		switch (*curptr) {
			case '{':
				[_builder beginDictionary];
				inDict=YES;
				inArray=NO;
				curptr++;
				break;
			case '}':
				[_builder endDictionary];
				curptr++;
				break;
			case '[':
				[_builder beginArray];
				inDict=NO;
				inArray=YES;
				curptr++;
				break;
			case ']':
				[_builder endArray];
				curptr++;
				break;
			case '"':
                parsestring( curptr , endptr, &stringstart, &curptr  );
                curstr = [self makeRetainedJSONStringStart:stringstart length:curptr-stringstart];
				curptr++;
				if ( *curptr == ':' ) {
					[_builder writeKey:curstr];
					curptr++;
					
				} else {
					[_builder writeString:curstr];
				}
				break;
			case ',':
				curptr++;
				break;
			case '-':
			case '0':
			case '1':
			case '2':
			case '3':
			case '4':
			case '5':
			case '6':
			case '7':
			case '8':
			case '9':
			{
				BOOL isReal=NO;
				const char *numstart=curptr;
				id number=nil;
				if ( *curptr == '-' ) {
					curptr++;
				}
				while ( curptr < endptr && isdigit(*curptr) ) {
					curptr++;
				}
				if ( *curptr == '.' ) {
					curptr++;
					while ( curptr < endptr && isdigit(*curptr) ) {
						curptr++;
					}
					isReal=YES;
				}
				if ( curptr < endptr && (*curptr=='e' | *curptr=='E') ) {
					curptr++;
					while ( curptr < endptr && isdigit(*curptr) ) {
						curptr++;
					}
					isReal=YES;
				}
                number = isReal ?
                            [self realElement:numstart length:curptr-numstart] :
                            [self integerElementAtPtr:numstart length:curptr-numstart];

				[_builder writeString:number];
				break;
			}
			case 't':
				if ( (endptr-curptr) >=4  && !strncmp(curptr, "true", 4)) {
					curptr+=4;
					[_builder pushObject:true_value];
				}
				break;
			case 'f':
				if ( (endptr-curptr) >=5  && !strncmp(curptr, "false", 5)) {
					// return false;
					curptr+=5;
					[_builder pushObject:false_value];

				}
				break;
			case 'n':
				if ( (endptr-curptr) >=4  && !strncmp(curptr, "null", 4)) {
					[_builder pushObject:[NSNull null]];
					curptr+=4;
				}
				break;
			case ' ':
			case '\n':
				while (curptr < endptr && isspace(*curptr)) {
					curptr++;
				}
				break;

			default:
				[NSException raise:@"invalidcharacter" format:@"JSON invalid character %x/'%c' at %td",*curptr,*curptr,curptr-(char*)[data bytes]];
				break;
		}
	}
    return [_builder result];

}

It almost certainly doesn't correctly handle all edge-cases, but doing so is unlikely to impact overall performance.

Dematerializing Property Lists with MPWPlistStreaming

Above, I mentioned that MASON is message-oriented, and that its main purpose is sending messages of the MPWPlistStreaming protocol to its builder. Here is that protocol:


@protocol MPWPlistStreaming

-(void)beginArray;
-(void)endArray;
-(void)beginDictionary;
-(void)endDictionary;
-(void)writeKey:aKey;
-(void)writeString:aString;
-(void)writeNumber:aNumber;
-(void)writeObject:anObject forKey:aKey;
-(void)pushContainer:anObject;
-(void)pushObject:anObject;

@end

What this enables is using property lists as an intermediate format without actually instantiating them, instead sending the messages we would have sent if we had a property list. Protocol Oriented Programming, anyone? Oh, I forgot, you can only do that in Swift...

The same protocol can also be used on the output side, then you get something like Standard Object Out.

Trying it out

By default, MPWMASONParser sets its builder to an instance of MPWPlistBuilder, which, as the name hints, builds property lists. Just like NSJSONSerialization.

So let's give it a whirl:


-(void)decodeMPWDicts:(NSData*)json
{
    MPWMASONParser *parser=[MPWMASONParser parser];
    NSArray* plistResult = [parser parsedData:json];
    NSLog(@"MPWMASON %@ with %ld dicts",[plistResult firstObject],[plistResult count]);
}

And the time is, drumroll, ... 0.621 seconds.

Hmm...that's disappointing. We didn't do anything wrong, yet almost 50% slower than NSJSONSerialization. Well, those dang Apple engineers do know what they're doing after all, and we should probably just give up.

Well, not so fast. Let's at least check out what we did wrong. Unleash the Cracken...er...Instruments!

So that's interesting: the vast majority of time is actually spent in Apple code building the plist. And we have to build the plist. So how does NSJSONSerialization get the same job done faster? Last I checked, with NSPropertyListSerialization, but close enough, they actually use specialised CoreFoundation-based dictionaries that are optimized for the case of having a lot of string keys and having them all in one place during initialization. These are not exposed, CoreFoundation being C-based means non-exposure is very effective and apparently Apple stopped open-sourcing CFLite a while ago.

So how can we do better? Tune in for the next exciting instalment :-)

No comments: