Showing posts with label C. Show all posts
Showing posts with label C. Show all posts

Friday, April 24, 2020

Faster JSON Support for iOS/macOS, Part 7: Polishing the Parser

A convenient setback

One thing that you may have noticed last time around was that we were getting the instance variable names from the class, but then also still manually setting the common keys manually. That's a bit of duplicated and needlessly manual effort, because the common keys are exactly those ivar names.

However, the two pieces of information are in different places, the ivar names in the builder and the common strings in the in the parse itself. One way of consolidating this information is by creating a convenience intializer for decoding to objects as follows:



-initWithClass:(Class)classToDecode
{
    self = [self initWithBuilder:[[[MPWObjectBuilder alloc] initWithClass:classToDecode] autorelease]];
    [self setFrequentStrings:(NSArray*)[[[classToDecode ivarNames] collect] substringFromIndex:1]];
    return self;
}

We still compute the ivar names twice, but that's not really such a big deal, so something we can fix later, just like the issue that we should probably be using property names instead of instance variable names that in the case of properties we have to post-process to get rid of the underscores added by ivar synthesis.

With that, the code to parse to objects simplifies to the following, very similar to what you would see in Swift with JSONDecoder.


-(void)decodeMPWDirect:(NSData*)json
{
    MPWMASONParser *parser=[[MPWMASONParser alloc] initWithClass:[TestClass class]];
    NSArray* objResult = [parser parsedData:json];
}

So, quickly verifying that performance is still the same (always do this!) and...oops! Performance dropped significantly, from 441ms to over 700ms. How could such an innocuous change lead to a 50% performance regression?

The profile shows that we are now spending significantly more time in MPWSmallStringTable's objectForKey: method, where it gets the bytes out of the NSString/CFString, but why that should be the case is a bit mysterious, since we changed virtually nothing.

A little further sleuthing revealed that the strings in question are now instances of NSTaggedPointerString, where previously they were instances of __NSCFConstantString. The latter has a pointer to its byte-oriented character orientation, which it can simply return, while the former cleverly encodes the characters in the pointer itself, so it first has to reconstruct that byte representation. The method of constructing that representation and computing the size of such a representation also appears to be fairly generic and slow via a stream.

This isn't really easy to solve, since the creation of NSTaggedPointerStrring instances is hardwired pretty deep in CoreFoundation with no way to disable this "optimization". Although it would be possible to create a new NSString subclass with a byte buffer, make sure to convert to that class before putting instances in the lookup table, that seems like a lot of work. Or we could just revert this convenience.

Damn the torpedoes and full speed ahead!

Alternatively, we really wanted to get rid of this whole process of packing character data into NSString instances just to immediately unpack them again, so let's leave the regression as is and do that instead.

Where previously the builder had a NSString *key instance vaiable, it now has a char *keyStr and a int keyLen. The string-handling case in the JSON parser is now split betweeen the key and the non-key casse, with the non-key case still doing the conversion, but the key-case directly sending the char* and length to the builder.


			case '"':
                parsestring( curptr , endptr, &stringstart, &curptr  );
				if ( curptr[1] == ':' ) {
                    [_builder writeKeyString:stringstart length:curptr-stringstart];
					curptr++;
					
				} else {
                    curstr = [self makeRetainedJSONStringStart:stringstart length:curptr-stringstart];
					[_builder writeString:curstr];
				}
                curptr++;
				break;

This means that at least temporarily, JSON escape handling is disabled for keys. It's straightforward to add back, makeRetainedJSONStringStart:length: does all its processing in a character buffer, only converting to a string object at the very end.


-(void)writeString:(NSString*)aString
{
    if ( keyStr ) {
        MPWValueAccessor *accesssor=OBJECTFORSTRINGLENGTH(self.accessorTable, keyStr, keyLen);
        [accesssor setValue:aString forTarget:*tos];
        keyStr=NULL;
    } else {
        [self pushObject:aString];
    }
}

If there is a key, we are in a dictionary, otherwise an array (or top-level). In the dictionary case, we can now fetch the ValueAccessor via the OBJECTFORSTRINGLENGTH() macro.

The results are encouraging: 299ms, or 147 MB/s.

The MPWPlistBuilder also needs to be adjusted: as it builds and NSDictionary and not an object, it actually needs the NSString key, but the parser no longer delivers those. So it just creates them on the fly:


-(NSString*)key
{
    NSString *key=nil;
    if ( keyStr) {
        if ( _commonStrings ) {
            key=OBJECTFORSTRINGLENGTH(_commonStrings, keyStr, keyLen);
        }
        if ( !key ) {
            key=[[[NSString alloc] initWithBytes:keyStr length:keyLen encoding:NSUTF8StringEncoding] autorelease];
        }
    }
    return key;
}

Surprisingly, this makes the dictionary parsing code slightly faster, bringing up to par with NSSJSSONSerialization at 421ms.

Eliminating NSNumber

Our use of NSNumber/CFNumber values is very similar to our use of NSString for keys: the parser wraps the parsed number in the object, the builder then unwraps it again.

Changing that, initially just for integers, is straightforward: add an integer-valued message to the builder protocol and implement it.


-(void)writeInteger:(long)number
{
    if ( keyStr ) {
        MPWValueAccessor *accesssor=OBJECTFORSTRINGLENGTH(_accessorTable, keyStr, keyLen);
        [accesssor setIntValue:number forTarget:*tos];
        keyStr=NULL;
    } else {
        [self pushObject:@(number)];
    }
}

The actual integer parsing code is not in MPWMASONParser but its superclasss, and as we don't want to touch that for now, let's just copy-paste that code, modifying it to return a C primitive type instead of an object.


-(long)longElementAtPtr:(const char*)start length:(long)len
{
    long val=0;
    int sign=1;
    const char *end=start+len;
    if ( start[0] =='-' ) {
        sign=-1;
        start++;
    } else if ( start[0]=='+' ) {
        start++;
    }
    while ( start < end && isdigit(*start)) {
        val=val*10+ (*start)-'0';
        start++;
    }
    val*=sign;
    return val;
}

I am sure there are better ways to turn a string into an int, but it will do for now. Similarly to the key/string distinction, we now special case integers.
                if ( isReal) {
                    number = [self realElement:numstart length:curptr-numstart];

                    [_builder writeString:number];
                } else {
                    long n=[self longElementAtPtr:numstart length:curptr-numstart];
                    [_builder writeInteger:n];
                }

Again, not pretty, but we can clean it up later.

Together with using direct instance variable access instead of properties to get to the accessorTable, this yields a very noticeable speed boost:

229 ms, or 195 MB/s.

Nice.

Discussion

What happened here? Just random hacking on the profile and replacing nice object-oriented programming with ugly but fast C?

Although there is obviously some truth in that, profiles were used and more C primitive types appeared, I would contend that what happened was a move away from objects, and particularly away from generic and expensive Foundation objects ("Foundation oriented programming"?) towards message oriented programming.

I'm sorry that I long ago coined the term "objects" for this topic because it gets many people to focus on the lesser idea.

The big idea is "messaging" -- that is what the kernal of Smalltalk/Squeak is all about (and it's something that was never quite completed in our Xerox PARC phase). The Japanese have a small word -- ma -- for "that which is in between" -- perhaps the nearest English equivalent is "interstitial". The key in making great and growable systems is much more to design how its modules communicate rather than what their internal properties and behaviors should be.

It turns out that message oriented programming (or should we call it Protocol Oriented Programming?) is where Objective-C shines: coarse-grained objects, implemented in C, that exchange messages, with the messages also as primitive as you can get away with. That was the idea, and when you follow that idea, Objective-C just hums, you get not just fast, but also flexible and architecturally nicely decoupled objects: elegance.

The combination of objects + primitive messages is very similar to another architecturally elegant and productive style: Unix pipes and filters. The components are in C and can have as rich an internal structure as you want, but they have to talk to each other via byte-streams. This can also be made very fast, and also prevents or at least reduces coupling between the components.

Another aspect is the tension between an API for use and an API for reuse, particularly within the constraints of call/return. When you get tasked with "Create a component + API for parsing JSON", something like NSJSONSerialization is something you almost have to come up with: feed it JSON, out comes parsed JSON. Nothing could be more convenient to use for "parsing JSON".

MPWMASONParser on the other hand is not convenient at all when viewed in isolation, but it's much more capable of being smoothly integrated into a larger processing chain. And most of the work that NSJSONSerialization did in the name of convenience is now just wasted, it doesn't make further processing any easier but sucks up enormous amounts of time.

Anyway, let's look at the current profile:

First, times are now small enough that high-resolution (100µs) sampling is now necessary to get meaningful results. Second, the NSNumber/CFNumber and NSString packing and unpacking is gone, with an even bigger chunk of the remaining time now going to object creation. objc_msgSend() is now starting to actually become noticeable, as is the (inefficient) character level parsing. The accessors of our test objects start to appear, if barely.

With the work we've done so far, we've improved speed around 5x from where we started, and at 195 MB/s are almost 20x faster than Swift's JSONDecoder.

Can we do better? Stay tuned.

Note

I can help not just Apple, but also you and your company/team with performance and agile coaching, workshops and consulting. Contact me at info at metaobject.com.

TOC

Somewhat Less Lethargic JSON Support for iOS/macOS, Part 1: The Status Quo
Somewhat Less Lethargic JSON Support for iOS/macOS, Part 2: Analysis
Somewhat Less Lethargic JSON Support for iOS/macOS, Part 3: Dematerialization
Equally Lethargic JSON Support for iOS/macOS, Part 4: Our Keys are Small but Legion
Less Lethargic JSON Support for iOS/macOS, Part 5: Cutting out the Middleman
Somewhat Faster JSON Support for iOS/macOS, Part 6: Cutting KVC out of the Loop
Faster JSON Support for iOS/macOS, Part 7: Polishing the Parser
Faster JSON Support for iOS/macOS, Part 8: Dematerialize All the Things!
Beyond Faster JSON Support for iOS/macOS, Part 9: CSV and SQLite

Thursday, November 7, 2019

Instant Builds

One of the goals I am aiming for in Objective-Smalltalk is instant builds and effective live programming.

A month ago, I got a package from an old school friend: my old Apple ][+, which I thought I had given as a gift, but he insisted had been a long-term loan. That machine featured 48KB of DRAM and a 1 MHz, 8 bit 6502 processor that took multiple cycles for even the simplest instructions, had no multiply instructions and almost no registers. Yet, when I turn it on it becomes interactive faster than the CRT warms up, and the programming experience remains fully interactive after that. I type something in, it executes. I change the program, type "RUN" and off it goes.

Of course, you can also get that experience with more complex systems, Smalltalk comes to mind, but the point is that it doesn't take the most advanced technology or heroic effort to make systems interactive, what it takes is making it a priority.


But here we are indeed.

Now Swift is only one example of this, it's a current trend, and of course these systems do claim that they provide benefits that are worth the wait. From optimizations to static type-checking with type-inference, so that "once it compiles, it works". This is deemed to be (a) 100% worthwhile despite the fact that there is no scientific evidence backing up these claims (a paper which claimed that it had the evidence was just shredded at this year's OOPSLA) and (b) essentially cost-free. But of course it isn't cost free:

So when everyone zigs, I zag, it's my contrarian nature. Where Swift's message was, essentially "there is too much Smalltalk in Objective-C", my contention is that there is too little Smalltalk in Objective-C (and also that there is too little "Objective" in Smalltalk, but that's a different topic).

Smalltalk was perfectly interactive in its own environment on high end late 70s and early 80s hardware. With today's monsters of computation, there is no good reason, or excuse for that matter, to not be interactive even when taken into the slightly more demanding Unix/macOS/iOS development world. That doesn't mean there aren't loads of reasons, they're just not any good.

So Objective-Smalltalk will be fast, it will be live or near-live at all times, and it will have instant builds. This isn't going to be rocket science, mostly, the ingredients are as follows:

  1. An interpreter
  2. Late binding
  3. Separate compilation
  4. A fast and simple native compiler
Let's look at these in detail.

An interpreter

The basic implementation of Objective-Smalltalk is an AST-walking interpreter. No JIT, not even a simple bytecode interpreter. Which is about as pessimal as possible, but our machines are so incredibly fast, and a lot of our tasks simple enough or computational steering enough that it actually does a decent enough job for many of those tasks. (For more on this dynamic, see The Death of Optimizing Compilers by Daniel J. Bernstein)

And because it is just an interpreter, it has no problems doing its thing on iOS:

(Yes, this is in the simulator, but it works the same on an actual device)

Late Binding

Late binding nicely decouples the parts of our software. This means that the compiler has very little information about what happens and can't help a lot in terms of optimization or checking, something that always drove the compiler folks a little nuts ("but we want to help and there's so much we could do"). It enables strong modularity and separate compilation. Objective-Smalltalk is as late-bound in its messaging as Objective-C or Smalltalk are, but goes beyond them by also late-binding identifiers, storage and dataflow with Polymorphic Identifiers (ACM, pdf), Storage Combinators (ACM, pdf) and Polymorphic Write Streams (ACM, pdf).

Allowing this level of flexibility while still not requiring a Graal-level Helden-JIT to burn away all the abstractions at runtime will require careful design of the meta-level boundaries, but I think the technically desirable boundaries align very well with the conceptually desirable boundaries: use meta-level facilities to define the language you want to program in, then write your program.

It's not making these boundaries clear and freely mixing meta-level and base-level programming that gets us in not just conceptual trouble, but also into the kinds of technical trouble that the Heldencompilers and Helden-JITs have to bail us out of.

Separate Compilation

When you have good module boundaries, you can get separate compilation, meaning a change in file (or other code-containing entity if you don't like files) does not require changes to other files. Smalltalk had this. Unix-style C programming had this, and the concept of binary libraries (with the generalization to frameworks on macOS etc.). For some reason, this has taken more and more of a back-seat in macOS and iOS development, with full source inclusion and full builds becoming the norm in the community (see CocoaPods) and for a long time being enforced by Apple by not allowing user-define dynamic libraries on iOS.

While Swift allows separate compilation, this can have such severe negative effects on both performance and compile times that compiling everything on any change has become a "best practice". In fact, we now have a build option "whole module optimization with optimizations turned off" for debugging. I kid you not.

Objective-Smalltalk is designed to enable "Framework-oriented-programming", so separate compilation is and will remain a top priority.

A fast and simple native compiler

However, even with an interpreter for interactive adjustments, separate compilation due to good modularity and late binding, you sometimes want to do a full build, or need to rebuild a large part of the codebase.

Even that shouldn't take forever, and in fact it doesn't need to. I am totally with Jonathan Blow on this subject when he says that compiling a medium size project shouldn't really more than a second or so.

My current approach for getting there is using TinyCC's backend as the starting point of the backend for Objective-Smalltalk. After all, the semantics are (mostly) Objective-C and Objective-C's semantics are just C. What I really like about tcc is that it goes so brutally directly to outputting CPU opcode as binary bytes.


static void gcall_or_jmp(int is_jmp)
{
    int r;
    if ((vtop->r & (VT_VALMASK | VT_LVAL)) == VT_CONST &&
	((vtop->r & VT_SYM) && (vtop->c.i-4) == (int)(vtop->c.i-4))) {
        /* constant symbolic case -> simple relocation */
        greloca(cur_text_section, vtop->sym, ind + 1, R_X86_64_PLT32, (int)(vtop->c.i-4));
        oad(0xe8 + is_jmp, 0); /* call/jmp im */
    } else {
        /* otherwise, indirect call */
        r = TREG_R11;
        load(r, vtop);
        o(0x41); /* REX */
        o(0xff); /* call/jmp *r */
        o(0xd0 + REG_VALUE(r) + (is_jmp << 4));
    }
}

No layers of malloc()ed intermediate representations here! This aligns very nicely with the streaming/messaging approach to high-performance I've taken elsewhere with Polymorphic Write Streams (see above), so I am pretty confident I can make this (a) work and (b) simple/elegant while keeping it (c) fast.

How fast? I obviously don't know yet, but tcc is a fantastic starting point. The following is the current (=wrong) ObjectiveTcc code to drive tcc to build a function that sends a single message:


-(void)generateMessageSendTestFunctionWithName:(char*)name
{
    SEL flagMsg=@selector(setMsgFlag);
    [self functionOnlyWithName:name returnType:VT_INT argTypes:"" body:^{
        [self pushFunctionPointer:objc_msgSend];
        [self pushObject:self];
        [self pushPointer:flagMsg];
        [self call:2];
    }];
}

How often can I do this in one second? On my 2018 high spec but 13" MBP: 300,000 times. Including in-memory linking (though not much of that happening in this example), not including Mach-O generation as that's not implemented yet and writing the whole shebang to disk. I don't anticipate either of these taking appreciably additional time.

If we consider this 2 "lines" of code, one for the function/method header and one for the message, then we can generate binary for 600KLOC/s. So having a medium size program compile and link in about a second or so seems eminently doable, even if I manage to slow the raw Tcc performance down by about an order of magnitude.

(For comparison: the Swift code base that motivated the Rome caching system for Carthage was clocking in at around 60 lines per second with the then Swift compiler. So even with an anticipated order of magnitude slowdown we'd still be 1000x faster. 1000x is good enough, it's the difference between 3 seconds and an hour.)

What's the downside? Tcc doesn't do a lot of optimization. But that's OK as (a) the sorts of optimizations C compilers and backends like LLVM do aren't much use for highly polymorphic and late-bound code and (b) the basics get you around 80% of the way (c) most code doesn't need that much optimization (see above) and (d) machines have become really fast.

And it helps that we aren't doing crazy things like initially allocating function-local variables on the heap or doing function argument copying via vtables that require require leaning on the optimizer to get adequate performance (as in: not 100x slower..).

Defense in Depth

While any of these techniques might be adequate some of the time, it's the combination that I think will make the Objective-Smalltalk tooling a refreshing, pleasant and highly productive alternative to existing toolchains, because it will be reliably fast under all circumstances.

And it doesn't really take (much) rocket science, just a willingness to make this aspect a priority.

Wednesday, July 4, 2018

A one word change to the C standard to make undefined behavior sane again

A lot has been written on C undefined behavior, some of it by myself and a lot more by people who know a lot more about compilers than I do. However, I now believe that a seemingly innocuous but far-reaching change to the standard has given permission for the current craziness, and I think undoing that change could be a start in rectifying the situation.

Proposal

In section 3.4.3, change the word "possible" back to "permissible", the way it was in C89.

Background

In all versions of the standard I have checked, section 3.4.3 defines the term "undefined behavior".
undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

So that seems pretty clear, the compiler can do whatever it wants. But wait, there is a second paragraph that clarifies:

Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
So it's not a free-for-all, in fact it is pretty clear about what the compiler is and is not allowed to do, as there are essentially three options:
  1. It "ignores" the situation completely, so if the CPU hardware produces an overflow or underflow on an arithmetic operation, well that's what you get. If you write to a string constant, the compiler emits the write and either the string constant might get changed if there is no memory protection for string constants or you might get a segfault if there is.
  2. It "behaves in a manner characteristic of the environment". So no "demons flying out of your nose" nonsense, and no arbitrary transformations of programs. And whatever you do, you have to document it, though you are not required to print a diagnostic.
  3. It can terminate with an error message.
I would suggest that current behavior is not one of these three, and it's not in the range bounded by these three either. It is clearly outside that defined range of "permissible" undefined behavior.

But of course compiler writers have an out, because more recent versions of the standard changed the word "permissible", which clearly restricts what you are allowed to do, to "possible", which means this is just an illustration of what might happen.

So let's change the word back to "permissible".