Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Friday, April 10, 2020

Somewhat Less Lethargic JSON Support for iOS/macOS, Part 1: The Status Quo

I just finished watching Daniel Lemire's talk on the current iteration of simdjson, a JSON parser that clocks in at 2.5GB/s! I've been following Daniel's work for some time now and can't really recommend it highly enough.

This reminded me of a recent twitter conversation where I had offered to contribute a fast, Swift-compatible JSON parser loosely based on MAX, my fast and convenient XML parser. Due to various factors most of which are not under my control, I can't really offer anything that's fast when compared to simdjson, but I can manage something quite a bit less lethargic than what's currently on offer in the Apple and particularly the Swift world.

Environmental assumptions and constraints

My first assumption is that we are going to operate in the Apple ecosystem, and for simplicity's sake I am going to use macOS. Next, I will assume that what we want from our parse(r) are domain objects for further processing within our application (or structs, the difference is not important in this context).

We are going to use the following class with a mix of integer and string instance variables, in Swift:


@objc class TestClass: NSObject, Codable {
    let hi:Int
    let there:Int
    let comment:String
...
}

and the same in Objective-C:


@interface TestClass : NSObject

@property (nonatomic) long hi,there;
@property (nonatomic,strong) NSString *comment;

@end

To make it all easy to measure, we are going to use one million objects, which we are going to initialise with increasing integers and the constant string "comment". This yields the same 44MB JSON file with different serialisation methods, which can be correctly parsed by all the parsers tested. This is obviously a very simple class an file structure, but I think it gives a reasonable approximation for real-world use.

The first thing to check is how quickly we can create these objects straight in code, without any parsing.

That should give us a good upper bound for the performance we can achieve when parsing to domain objects.


#define COUNT 1000000
-(void)createObjects
{
    NSMutableArray *objResult=[NSMutableArray arrayWithCapacity:COUNT+20];
    for ( int i=0;i<COUNT;i++ ) {
        TestClass *cur=[TestClass new];
        cur.hi=i;
        cur.there=i;
        cur.comment=@"comment";
        [objResult addObject:cur];
    }
    NSLog(@"Created objects in code w/o parsing %@ with %ld objects",objResult[0],[objResult count]);
}

On my Quad Core, 2.7Ghz MBP '18, this runs in 0.141 seconds. Although we aren't actually parsing, it would mean that just creating all the objects that would result from parsingg our 44MB JSON file would yield a rate of 312 MB/s.

Wait a second! 312MB/s is almost 10x slower than Daniel Lemire's parser, the one that actually parses JSON, and we are only creating the objects that would result if we were parsing, without doing any actual parsing.

This is one of the many unintuitive aspects of parsing performance: the actual low-level, character-level parsing is generally the least important part for overall performance. Unless you do something crazy like use NSScanner. Don't use NSScanner. Please.

One reason this is unintuitive is that we all learned that performance is dominated by the innermost loop, and character level processing is the innermost loop. But when you have magnitudes in performance differences and inner and outer loops differ by less than that amount, the stuff happennnig in the outer loop can dominate.

NSJSONSerialization

Apple's JSON story very much revolves around NSJSONSerialization, very much like most of the rest of its serialization story revolves around the very similar NSPropertyListSerialization class. It has a reasonable quick implementation, turning the 44 MB JSON file into an NSArrray of NSDictionary instances in 0.421 seconds when called from Objective-C, for a rate of 105 MB/s. From Swift, it takes 0.562 seconds, for 78 MB/s.

Of course, that gets us to a property list (array of dicts, in this case), not to the domain objects we actually want.

If you read my book (did I mention my book? Oh, I think I did), you will know that this type of dictonary representation is fairly expensive: expensive to create, expensive in terms of memory consumption and expensive to access. Just creating dictionaries equivalent to the objects we created before takes 0.321 seconds, so around 2.5x the time for creating the equivalent objects and a "rate" of 137 MB/s relative to our 44 MB JSON file.


-(void)createDicts
{
    NSMutableArray *objResult=[NSMutableArray arrayWithCapacity:COUNT+20];
    for ( int i=0;i<COUNT;i++ ) {
        NSMutableDictionary *cur=[NSMutableDictionary dictionary];
        cur[@"hi"]=@(i);
        cur[@"there"]=@(i);
        cur[@"comment"]=@"comment";
        [objResult addObject:cur];
    }
    NSLog(@"Created dicts in code w/o parsing %@ with %ld objects",objResult[0],[objResult count]);
}

Creating the dict in a single step using a dictionary literal is not significantly faster, but creating an immutable copy of the mutable dict after we're done filling brings the time to half a second.

Getting from dicts to objects is typically straightforward, if tedious: just fetch the entry of the dictionary and call the corresponding setter with the value thus retrieved from the dictionary. As this isn't production code and we're just trying to get some bounds of what is possible, there is an easier way: just use Key Value Coding with the keys found in the dictionary. The combined code, parsing and then creating the objects is shown below:


-(void)decodeNSJSONAndKVC:(NSData*)json
{
    NSArray *keys=@[ @"hi", @"there", @"comment"];
    NSArray *plistResult=[NSJSONSerialization JSONObjectWithData:json options:0 error:nil];
    NSMutableArray *objResult=[NSMutableArray arrayWithCapacity:plistResult.count+20];
    for ( NSDictionary *d in plistResult) {
        TestClass *cur=[TestClass new];
        for (NSString *key in keys) {
            [cur setValue:d[key] forKey:key];
        }
        [objResult addObject:cur];
    }
    NSLog(@"NSJSON+KVC %@ with %ld objects",objResult[0],[objResult count]);
}

Note that KVC is slow. Really slow. Order-of-magnitude slower than just sending messages kind of slow, and so it has significant impact on the total time, which comes to a total of 1.142 seconds including parsing and object creation, or just shy of 38 MB/s.

Swift JSON Coding

For the first couple of releases of Swift, JSON support by Apple was limited to a wrapped NSJSONSerialization, with the slight performance penalty already noted. As I write in my book (see sidebar), many JSON "parsers" were published, but none of these with the notable exception of the Big Nerd Ranch's Freddy were actual parses, they all just transformed the arrays and dictionaries returned by NSJSONSerialization into Swift objects. Performance was abysmal, with around 25x overhead in addition to the basic NSJSONSerialization parse.

Apple's Swift Codable promised to solve all that, and on the convenience front it certainly does a great job.


    func readJSONCoder(data:Data) -> [TestClass] {
        NSLog("Swift Decoding")
        let coder=JSONDecoder( )
        let array=try! coder.decode([TestClass].self, from: data)
        return array
    }

(All the forcing is because this is just test code, please don't do this in production!). Alas, performance is still not great: 4.39 seconds, or 10 MB/s. That's 10x slower than the basic NSJSONSerialization parse and 4x slower than our slow but simple complete parse via NSJSONSerialization and KVC.

However, it is significantly faster than the previous third-party JSON to Swift objects "parsers", to the tune of 3-4x. This is the old "first mark up 400% then discount 50%" sales trick applied to performance, except that the relative numbers are larger.

Third Party JSON Parsers

I looked a little at third party JSON parsers, particularly JASON, STJSON and ZippyJSON.

STTJSON does not make any claims to speed and manages to clock in at 5 seconds, or just under 10 MB/s. JASON bills itself as a "faster" JSON parser (they compare to SwiftyJSON), and does reasonably well at 0.75 seconds or 59 MB/s. However both of these parse to their own internal representation, not to domain objects (or structs), and so should be compared to NSJSONSerialization, at which point they both disappoint.

Probably the most interesting of these is ZippyJSON, as it uses Daniel Lemire's simdjson and is Codable compatible. Alas, I couldn't get ZippyJSON to compile, so I don't have numbers, but I will keep trying. They claim around 3x faster than Apple's JSONDecoder, which would make it the only parser to be at least in the same ballpark as the trivial NSJSONSerialization + KVC method I showed above.

Another interesting tidbit comes from ZippyJSON's README, under the heading "Why is it so much faster".

Apple's version first converts the JSON into an NSDictionary using NSJSONSerialization and then afterwards makes things Swifty. The creation of that intermediate dictionary is expensive.
This is true by itself: first converting to an intermediate representation is slow, particularly one that's as heavy-weight as property lists. However, it cannot be the primary reason, because creating that expensive representation only takes 1/8th of the total running time. The other 7/8ths is Codable apparently talking to itself. And speaking very s-l-o-w-l-y while doing that.

To corroborate, I also tried a the Flight-School implementation of Codable for MessagePack, which obviously does not use NSJSONSerialization. It makes no performance claims and takes 18 seconds to decode the same objects we used in the JSON files, of course with a different file that's 34 MB in size. Normalized to our 44 MB file that would be 2.4 MB/s.

MAX and MASON

So where does that leave us? Considering what simdjs shows is theoretically possible with JSON parsing, we are not in a good place, to put it mildly. 2.5 GB/s vs. 10 MB/s with Apple's JSONDecoder, several times slower than NSJSONSerialization, which isn't exactly a speed daemon and around 30x slower than pure object creation. Comically bad might be another way of putting it. At least we're being entertained.

What can I contribute? Well, I've been through most of this once before with XML and the result was/is MAX (Messaging API for XML), a parser that is not just super-fast itself (though no SIMD), but also presents APIs that make it both super-convenient and also super-fast to go directly from the XML to an object-representation, either as a tree or a stream of domain objects while using mostly constant memory. Have I mentioned my book? Yeah, it's in the book, in gory detail.

Anyway, XML has sorta faded, so the question was whether the same techniques would work for a JSON parser. The answer is yes, roughly, though with some added complexity and less convenience because JSON is a less informative file format than XML. Open- and close-tags really give you a good heads-up as to what's coming that "{" just does not.

The goal will be to produce domain objects at as close to the theoretical maximum of slightly more than 300 MB/s as possible, while at the same time making the parser convenient to use, close to Swift Codable in convenience. It won't support Codable per default, as the overheads seem to be too high, but ZippyJSON suggests that an adapter wouldn't be too hard.

That parser is MPWMASONParser, and no, it isn't done yet. In its initial state, it parses JSON to dictionaries in 0.58 seconds, or 76 MB/s and slightly slower than NSJSONSerialization.

So we have a bit of way to go, come join me on this little parsing performance journey!

TOC

Somewhat Less Lethargic JSON Support for iOS/macOS, Part 1: The Status Quo
Somewhat Less Lethargic JSON Support for iOS/macOS, Part 2: Analysis
Somewhat Less Lethargic JSON Support for iOS/macOS, Part 3: Dematerialization
Equally Lethargic JSON Support for iOS/macOS, Part 4: Our Keys are Small but Legion
Less Lethargic JSON Support for iOS/macOS, Part 5: Cutting out the Middleman
Somewhat Faster JSON Support for iOS/macOS, Part 6: Cutting KVC out of the Loop
Faster JSON Support for iOS/macOS, Part 7: Polishing the Parser
Faster JSON Support for iOS/macOS, Part 8: Dematerialize All the Things!
Beyond Faster JSON Support for iOS/macOS, Part 9: CSV and SQLite

Thursday, November 7, 2019

Instant Builds

One of the goals I am aiming for in Objective-Smalltalk is instant builds and effective live programming.

A month ago, I got a package from an old school friend: my old Apple ][+, which I thought I had given as a gift, but he insisted had been a long-term loan. That machine featured 48KB of DRAM and a 1 MHz, 8 bit 6502 processor that took multiple cycles for even the simplest instructions, had no multiply instructions and almost no registers. Yet, when I turn it on it becomes interactive faster than the CRT warms up, and the programming experience remains fully interactive after that. I type something in, it executes. I change the program, type "RUN" and off it goes.

Of course, you can also get that experience with more complex systems, Smalltalk comes to mind, but the point is that it doesn't take the most advanced technology or heroic effort to make systems interactive, what it takes is making it a priority.


But here we are indeed.

Now Swift is only one example of this, it's a current trend, and of course these systems do claim that they provide benefits that are worth the wait. From optimizations to static type-checking with type-inference, so that "once it compiles, it works". This is deemed to be (a) 100% worthwhile despite the fact that there is no scientific evidence backing up these claims (a paper which claimed that it had the evidence was just shredded at this year's OOPSLA) and (b) essentially cost-free. But of course it isn't cost free:

So when everyone zigs, I zag, it's my contrarian nature. Where Swift's message was, essentially "there is too much Smalltalk in Objective-C", my contention is that there is too little Smalltalk in Objective-C (and also that there is too little "Objective" in Smalltalk, but that's a different topic).

Smalltalk was perfectly interactive in its own environment on high end late 70s and early 80s hardware. With today's monsters of computation, there is no good reason, or excuse for that matter, to not be interactive even when taken into the slightly more demanding Unix/macOS/iOS development world. That doesn't mean there aren't loads of reasons, they're just not any good.

So Objective-Smalltalk will be fast, it will be live or near-live at all times, and it will have instant builds. This isn't going to be rocket science, mostly, the ingredients are as follows:

  1. An interpreter
  2. Late binding
  3. Separate compilation
  4. A fast and simple native compiler
Let's look at these in detail.

An interpreter

The basic implementation of Objective-Smalltalk is an AST-walking interpreter. No JIT, not even a simple bytecode interpreter. Which is about as pessimal as possible, but our machines are so incredibly fast, and a lot of our tasks simple enough or computational steering enough that it actually does a decent enough job for many of those tasks. (For more on this dynamic, see The Death of Optimizing Compilers by Daniel J. Bernstein)

And because it is just an interpreter, it has no problems doing its thing on iOS:

(Yes, this is in the simulator, but it works the same on an actual device)

Late Binding

Late binding nicely decouples the parts of our software. This means that the compiler has very little information about what happens and can't help a lot in terms of optimization or checking, something that always drove the compiler folks a little nuts ("but we want to help and there's so much we could do"). It enables strong modularity and separate compilation. Objective-Smalltalk is as late-bound in its messaging as Objective-C or Smalltalk are, but goes beyond them by also late-binding identifiers, storage and dataflow with Polymorphic Identifiers (ACM, pdf), Storage Combinators (ACM, pdf) and Polymorphic Write Streams (ACM, pdf).

Allowing this level of flexibility while still not requiring a Graal-level Helden-JIT to burn away all the abstractions at runtime will require careful design of the meta-level boundaries, but I think the technically desirable boundaries align very well with the conceptually desirable boundaries: use meta-level facilities to define the language you want to program in, then write your program.

It's not making these boundaries clear and freely mixing meta-level and base-level programming that gets us in not just conceptual trouble, but also into the kinds of technical trouble that the Heldencompilers and Helden-JITs have to bail us out of.

Separate Compilation

When you have good module boundaries, you can get separate compilation, meaning a change in file (or other code-containing entity if you don't like files) does not require changes to other files. Smalltalk had this. Unix-style C programming had this, and the concept of binary libraries (with the generalization to frameworks on macOS etc.). For some reason, this has taken more and more of a back-seat in macOS and iOS development, with full source inclusion and full builds becoming the norm in the community (see CocoaPods) and for a long time being enforced by Apple by not allowing user-define dynamic libraries on iOS.

While Swift allows separate compilation, this can have such severe negative effects on both performance and compile times that compiling everything on any change has become a "best practice". In fact, we now have a build option "whole module optimization with optimizations turned off" for debugging. I kid you not.

Objective-Smalltalk is designed to enable "Framework-oriented-programming", so separate compilation is and will remain a top priority.

A fast and simple native compiler

However, even with an interpreter for interactive adjustments, separate compilation due to good modularity and late binding, you sometimes want to do a full build, or need to rebuild a large part of the codebase.

Even that shouldn't take forever, and in fact it doesn't need to. I am totally with Jonathan Blow on this subject when he says that compiling a medium size project shouldn't really more than a second or so.

My current approach for getting there is using TinyCC's backend as the starting point of the backend for Objective-Smalltalk. After all, the semantics are (mostly) Objective-C and Objective-C's semantics are just C. What I really like about tcc is that it goes so brutally directly to outputting CPU opcode as binary bytes.


static void gcall_or_jmp(int is_jmp)
{
    int r;
    if ((vtop->r & (VT_VALMASK | VT_LVAL)) == VT_CONST &&
	((vtop->r & VT_SYM) && (vtop->c.i-4) == (int)(vtop->c.i-4))) {
        /* constant symbolic case -> simple relocation */
        greloca(cur_text_section, vtop->sym, ind + 1, R_X86_64_PLT32, (int)(vtop->c.i-4));
        oad(0xe8 + is_jmp, 0); /* call/jmp im */
    } else {
        /* otherwise, indirect call */
        r = TREG_R11;
        load(r, vtop);
        o(0x41); /* REX */
        o(0xff); /* call/jmp *r */
        o(0xd0 + REG_VALUE(r) + (is_jmp << 4));
    }
}

No layers of malloc()ed intermediate representations here! This aligns very nicely with the streaming/messaging approach to high-performance I've taken elsewhere with Polymorphic Write Streams (see above), so I am pretty confident I can make this (a) work and (b) simple/elegant while keeping it (c) fast.

How fast? I obviously don't know yet, but tcc is a fantastic starting point. The following is the current (=wrong) ObjectiveTcc code to drive tcc to build a function that sends a single message:


-(void)generateMessageSendTestFunctionWithName:(char*)name
{
    SEL flagMsg=@selector(setMsgFlag);
    [self functionOnlyWithName:name returnType:VT_INT argTypes:"" body:^{
        [self pushFunctionPointer:objc_msgSend];
        [self pushObject:self];
        [self pushPointer:flagMsg];
        [self call:2];
    }];
}

How often can I do this in one second? On my 2018 high spec but 13" MBP: 300,000 times. Including in-memory linking (though not much of that happening in this example), not including Mach-O generation as that's not implemented yet and writing the whole shebang to disk. I don't anticipate either of these taking appreciably additional time.

If we consider this 2 "lines" of code, one for the function/method header and one for the message, then we can generate binary for 600KLOC/s. So having a medium size program compile and link in about a second or so seems eminently doable, even if I manage to slow the raw Tcc performance down by about an order of magnitude.

(For comparison: the Swift code base that motivated the Rome caching system for Carthage was clocking in at around 60 lines per second with the then Swift compiler. So even with an anticipated order of magnitude slowdown we'd still be 1000x faster. 1000x is good enough, it's the difference between 3 seconds and an hour.)

What's the downside? Tcc doesn't do a lot of optimization. But that's OK as (a) the sorts of optimizations C compilers and backends like LLVM do aren't much use for highly polymorphic and late-bound code and (b) the basics get you around 80% of the way (c) most code doesn't need that much optimization (see above) and (d) machines have become really fast.

And it helps that we aren't doing crazy things like initially allocating function-local variables on the heap or doing function argument copying via vtables that require require leaning on the optimizer to get adequate performance (as in: not 100x slower..).

Defense in Depth

While any of these techniques might be adequate some of the time, it's the combination that I think will make the Objective-Smalltalk tooling a refreshing, pleasant and highly productive alternative to existing toolchains, because it will be reliably fast under all circumstances.

And it doesn't really take (much) rocket science, just a willingness to make this aspect a priority.

Wednesday, October 7, 2015

Jitterdämmerung

So, Windows 10 has just been released, and with it Ahead Of Time (AOT) compilation feature .NET native. Google also just recently introduced ART for Android, and I just discovered that Oracle is planning an AOT compiler for mainstream Java.

With Apple doggedly sticking to Ahead of Time Compilation for Objective-C and now their new Swift, JavaScript is pretty much the last mainstream hold-out for JIT technology. And even in JavaScript, the state-of-the-art for achieving maximum performance appears to be asm.js, which largely eschews JIT techniques by acting as object-code in the browser represented in JavaScript for other languages to be AOT-compiled into.

I think this shift away from JITs is not a fluke but was inevitable, in fact the big question is why it has taken so long (probably industry inertia). The benefits were always less than advertised, the costs higher than anticipated. More importantly though, the inherent performance characteristics of JIT compilers don't match up well with most real world systems, and the shift to mobile has only made that discrepancy worse. Although JITs are not going to go away completely, they are fading into the sunset of a well-deserved retirement.

Advantages of JITs less than promised

I remember reading the copy of the IBM Systems Journal on Java Technology back in 2000, I think. It had a bunch of research articles describing super amazing VM technology with world-beating performance numbers. It also had a single real-world report from IBM's San Francisco project. In the real world, it turned out, performance was a bit more "mixed" as they say. In other words: it was terrible and they had to do an incredible amount of work for the system be even remotely usable.

There was also the experience of the New Typesetting System (NTS), a rewrite of TeX in Java. Performance was atrocious, the team took it with humor and chose a snail as their logo.

Nts at full speed One of the reasons for this less than stellar performance was that JITs were invented for highly dynamic languages such as Smalltalk and Self. In fact, the Java Hotspot VM can be traced in a direct line to Self via the Strongtalk system, whose creator Animorphic Systems was purchased by Sun in order to acquire the VM technology.

However, it turns out that one of the biggest benefits of JIT compilers in dynamic languages is figuring out the actual types of variables. This is a problem that is theoretically intractable (equivalent to the halting problem) and practically fiendishly difficult to do at compile time for a dynamic language. It is trivial to do at runtime, all you need to do is record the actual types as they fly by. If you are doing Polymorphic Inline Caching, just look at the contents of the caches after a while. It is also largely trivial to do for a statically typed language at compile time, because the types are right there in the source code!

So gathering information at runtime simply isn't as much of a benefit for languages such as C# and Java as it was for Self and Smalltalk.

Significant Costs

The runtime costs of a JIT are significant. The obvious cost is that the compiler has to be run alongside the program to be executed, so time compiling is not available for executing. Apart from the direct costs, this also means that your compiler is limited in the types of analyses and optimizations it can do. The impact is particularly severe on startup, so short-lived programs like for example the TeX/NTS are severely impacted and can often run slower overall than interpreted byte-code.

In order to mitigate this, you start having to have multiple compilers and heuristics for when to use which compilers. In other words: complexity increases dramatically, and you have only mitigated the problem somewhat, not solved it.

A less obvious cost is an increase in VM pressure, because the code-pages created by the JIT are "dirty", whereas executables paged in from disk are clean. Dirty pages have to be written to disk when memory is required, clean pages can simply be unmapped. On devices without a swap file like most smartphones, dirty vs. clean can mean the difference between a few unmapped pages that can be swapped in later and a process getting killed by the OS.

VM and cache pressure is generally considered a much more severe performance problem than a little extra CPU use, and often even than a lot of extra CPU use. Most CPUs today can multiply numbers in a single cycle, yet a single main memory access has the CPU stalled for a hundred cycles or more.

In fact, it could very well be that keeping non-performance-critical code as compact interpreted byte-code may actually be better than turning it into native code, as long as the code-density is higher.

Security risks

Having memory that is both writable and executable is a security risk. And forbidden on iOS, for example. The only exception is Apple's own JavaScript engine, so on iOS you simply can't run your own JITs.

Machines got faster

On the low-end of performance, machines have gotten so fast that pure interpreters are often fast enough for many tasks. Python is used for many tasks as is and PyPy isn't really taking the Python world by storm. Why? I am guessing it's because on today's machines, plain old interpreted Python is often fast enough. Same goes for Ruby: it's almost comically slow (in my measurements, serving http via Sinatra was almost 100 times slower than using libµhttp), yet even that is still 400 requests per second, exceeding the needs of the vast majority of web-sites including my own blog, which until recently didn't see 400 visitors per day.

The first JIT I am aware of was Peter Deutsch's PS (Portable Smalltalk), but only about a decade later Smalltalk was fine doing multi-media with just a byte-code interpreter. And native primitives.

Successful hybrids

The technique used by Squeak: interpreter + C primitives for heavy lifting, for example for multi-media or cryptography has been applied successfully in many different cases. This hybrid approach was described in detail by John Ousterhout in Scripting: Higher-Level Programming for the 21st Century: high level "scripting" languages are used to glue together high performance code written in "systems" languages. Examples include Numpy, but the ones I found most impressive were "computational steering" systems apparently used in supercomputing facilities such as Oak Ridge National Laboratories. Written in Tcl.

What's interesting with these hybrids is that JITs are being squeezed out at both ends: at the "scripting" level they are superfluous, at the "systems" level they are not sufficient. And I don't believe that this idea is only applicable to specialized domains, though there it is most noticeable. In fact, it seems to be an almost direct manifestation of the observations in Knuth's famous(ly misquoted) quip about "Premature Optimization":

Experience has shown (see [46], [51]) that most of the running time in non-IO-bound programs is concentrated in about 3 % of the source text.

[..] The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs. In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in soft- ware engineering. Of course I wouldn't bother making such optimizations on a one-shot job, but when it's a question of preparing quality programs, I don't want to restrict myself to tools that deny me such efficiencies.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

[..]

(Most programs are probably only run once; and I suppose in such cases we needn't be too fussy about even the structure, much less the efficiency, as long as we are happy with the answers.) When efficiencies do matter, however, the good news is that usually only a very small fraction of the code is significantly involved.

For the 97%, a scripting language is often sufficient, whereas the critical 3% are both critical enough as well as small and isolated enough that hand-tuning is possible and worthwhile.

I agree with Ousterhout's critics who say that the split into scripting languages and systems languages is arbitrary, Objective-C for example combines that approach into a single language, though one that is very much a hybrid itself. The "Objective" part is very similar to a scripting language, despite the fact that it is compiled ahead of time, in both performance and ease/speed of development, the C part does the heavy lifting of a systems language. Alas, Apple has worked continuously and fairly successfully at destroying both of these aspects and turning the language into a bad caricature of Java. However, although the split is arbitrary, the competing and diverging requirements are real, see Erlang's split into a functional language in the small and an object-oriented language in the large.

Unpredictable performance model

The biggest problem I have with JITs is that their performance model is extremely unpredictable. First, you don't know when optimizations are going to kick in, or when extra compilation is going to make you slower. Second, predicting which bits of code will actually be optimized well is also hard and a moving target. Combine these two factors, and you get a performance model that is somewhere between unpredictable and intractable, and therefore at best statistical: on average, your code will be faster. Probably.

While there may be domains where this is acceptable, most of the domains where performance matters at all are not of this kind, they tend to be (soft) real time. In real time systems average performance matters not at all, predictably meeting your deadline does. As an example, delivering 80 frames in 1 ms each and 20 frames in 20 ms means for 480ms total time means failure (you missed your 60 fps target 20% of the time) whereas delivering 100 frames in 10 ms each means success (you met your 60 fps target 100% of the time), despite the fact that the first scenario is more than twice as fast on average.

I really learned this in the 90ies, when I was doing pre-press work and delivering highly optimized RIP and Postscript processing software. I was stunned when I heard about daily newspapers switching to pre-rendered, pre-screened bitmap images for their workflows. This is the most inefficient format imaginable for pre-press work, with each page typically taking around 140 MB of storage uncompressed, whereas the Postscript source would typically be between 1/10th and 1/1000th of the size. (And at the time, 140MB was a lot even for disk storage, never mind RAM or network capacity.

The advantage of pre-rendered bitmaps is that your average case is also your worst case. Once you have provisioned your infrastructure to handle this case, you know that your tech stack will be able to deliver your newspaper on time, no matter what the content. With Postscript (and later PDF) workflows, you average case is much better (and your best case ridiculously so), but you simply don't get any bonus points for delivering your newspaper early. You just get problems if it's late, and you are not allowed to average the two.

Eve could survive and be useful even if it were never faster than, say, Excel. The Eve IDE, on the other hand, can't afford to miss a frame paint. That means Imp must be not just fast but predictable - the nemesis of the SufficientlySmartCompiler.
I also saw this effect in play with Objective-C and C++ projects: despite the fact that Objective-C's primitive operations are generally more expensive, projects written in Objective-C often had better performance than comparable C++ projects, because the Objective-C's performance model was so much more simple, obvious and predictable.

When Apple was still pushing the Java bridge, Sun engineers did a stint at a WWDC to explain how to optimize Java code for the Hotspot JIT. It was comical. In order to write fast Java code, you effectively had to think of the assembler code that you wanted to get, then write the Java code that you thought might net that particular bit of machine code, taking into account the various limitations of the JIT. At that point, it is a lot easier to just write the damn assembly code. And vastly more predictable, what you write is what you get.

Modern JITs are capable of much more sophisticated transformations, but what the creators of these advanced optimizers don't realize is that they are making the problem worse rather than better. The more they do, the less predictable the code becomes.

The same, incidentally, applies to SufficentlySmart AOT compilers such as the one for the Swift language, though the problem is not quite as severe as with JITs because you don't have the dynamic component. All these things are well-intentioned but all-in-all counter-productive.

Conclusion

Although the idea of Just in Time Compilers was very good, their area of applicablity, which was always smaller than imagined and/or claimed, has shrunk ever further due to advances in technology, changing performance requirements and the realization that for most performance critical tasks, predictability is more important than average speed. They are therefore slowly being phased out in favor of simpler, generally faster and more predictable AOT compilers. Although they are unlikely to go away completely, their significance will be drastically diminished.

Alas, the idea that writing high-level code without any concessions to performance (often justified by misinterpreting or simply just misquoting Knuth) and then letting a sufficiently smart compiler fix it lives on. I don't think this approach to performance is viable, more predictability is needed and a language with a hybrid nature and the ability for the programmer to specify behavior-preserving transformations that alter the performance characteristics of code is probably the way to go for high-performance, high-productivity systems. More on that another time.

What do you think? Are JITs on the way out or am I on crack? Should we have a more manual way of influencing performance without completely rewriting code or just trusting the SmartCompiler?

Update: Nov. 13th 2017

The Mono Project has just announced that they are adding a byte-code interpreter: "We found that certain programs can run faster by being interpreted than being executed with the JIT engine."

Update: Mar. 25th 2021

Facebook has created an AOT compiler for JavaScript called Hermes. They report significant performance improvements and reductions in memory consumption (with of course some trade-offs for specific benchmarks).

Speaking of JavaScript, Google launhed Ignition in 2017, a JS bytecode interpreter for faster startup and better performance on low-memory devices. It also did much better in terms of pure performance than they anticipated.

So in 2019, Google introduced their JIT-Less JS engine. Not even an AOT compiler, just a pure interpreter. While seeing slowdowns in the 20-80% range on synthetic benchmarks, they report real-world performance only a few percent lower than their high-end, full-throttle TurboFan JIT.

Oh, how could I forget WebAssembly? "One of the reasons applications target WebAssembly is to execute on the web with predictable high performance. ".

I've also been told by People Who Know™ that the super-amazing Graal JIT VM for Java is primarily used for its AOT capabilities by customers. This is ironic, because the aptly named Graal pretty much is the Sufficiently Smart Compiler that was always more of a joke than a real goal...but they achieved it nonetheless, and as a JIT no less! And now that we have it, it turns out customers mostly don't care. Instead, they care about predictability, start-up performance and memory consumption.

Last not least, Apple's Rosetta 2 translator for running Intel binaries on ARM is an AOT binary-to-binary translator. And it is much faster, relatively, than the original Rosetta or comparable JIT-based binary translators, with translated Intel binaries clocking in at around 80% the speed of native ARM binaries. Combined with the amazing performance of the M1, this makes M1 Macs frequently faster than Intel Macs at running Intel Mac binaries. (Of course it also includes a (much slower) JIT component as a fallback when the AOT compiler can't figure things out. Like when the target program is itself a JIT... )

Update: Oct. 3rd 2021

It appears that 45% of CVEs for V8 and more than half of "in the wild" Chrome exploits abused a JIT bug, which is why Microsoft is experimenting with what they call SDSM (Super Duper Secure Mode). What's SDSM? JavaScript without the JIT.

Unsurprisingly at this point, performance is affected far less than one might have guessed, even for the benchmarks, and even less in real-world tasks: "Anecdotally, we find that users with JIT disabled rarely notice a difference in their daily browsing.".

Update: Mar. 30th 2023

To my eyes, this looks like startup costs, much of which will be jit startup costs.

HN user jigawatts appears to confirm my suspicion:

Teams is an entire web server written in an interpreted language compiled on the fly using a "Just in Time" optimiser (node.js + V8) -- coupled to a web browser that is basically an entire operating system. You're not "starting Teams.exe", you are deploying a server and booting an operating system. Seen in those terms, 9 seconds is actually pretty good. Seen in more... sane terms, showing 1 KB of text after 9 seconds is about a billion CPU instructions needed per character.
. It also looks to be the root problem, well one of the root problems with Visual Studio's launch time.

Update: Apr. 1st 2023

"One of the most annoying features of Julia is its latency: The laggy unresponsiveness of Julia after starting up and loading packages." -- Julia's latency: Past, present and future

I find it interesting that the term "JIT" is never mentioned in the post, I guess it's just taken as a given.

I am pretty sure I've missed some developments.

Tuesday, September 9, 2014

No Virginia, Swift is not 10x faster than Objective-C

About a month ago, Jesse Squires published a post titled Apples to Apples, documenting benchmark results that he claims show Swift now with a roughly 10x performance advantage over Objective-C. Although completely bogus, the post was retweeted by Chris Lattner (who should know better, and was supposedly mostly interested in highlighting the improvements in the Swift optimizer, rather than the bogus comparison) and has now been referenced a number of times as background knowledge as to the state of Swift. More importantly, though the actual mistake Jesse makes is pretty basic and not that interesting, it does point to some deeper misunderstandings about performance and language that I at least do find interesting.

So what's the mistake? Ironically, given the post's title, is that he is comparing apples to oranges, so to speak. The following table, which shows the time to sort an array of 10000 numbers 10 times in millisecond, illustrates the problem:

NSNumbernative integer
Objective-C6.040.97
Swift11.920.8
Jesse compared the two versions highlighted, so native Swift integers with Objective-C NSNumber object wrappers. All times are for binaries with optimization enabled, the machine was a 13" MBR with 2.9 GHz Intel Core i7 and 8GB of RAM. The integer sort was done using a C integer array and the system qsort() function. When you compare apples to apples, Objective-C has a roughly 2x edge with NSNumbers and is around 18% slower for native integers, at least when using qsort()

Why the 18% disadvantage? The qsort() function is made generically applicable to different types of arrays using a function pointer parameter for the comparison function that itself is parametrized using pointers to the elements to be compared. This means there is a per-comparison overhead of one function call and two pointer dereferences per comparison. That overhead overwhelms the actual comparison operation, which is a single machine instruction on most processors.

Swift, on the other hand, appears to produce a version of the sort function that is specialized to the integer type, with the comparison function inlined to the generated function so there is no function call or pointer dereference overhead. That is very clever and a Good Thing™ for performance. Sort of. The drawback is that this breaks separate compilation, because the functions actually have to be combined during the compile/link process every time it is used (I assume there is caching going on so we only got one per type combination).

Apart from making the compiler/linker slower , possibly significantly so (like C++ headers, though I presume they use LLVM bitcode to optimize the process), it also likely bloats the executable, causing cache and memory pressure. So it's a tradeoff, as usual, and while I think having the ability to specialize at compile-time is good, not being able to control it is not.

Objective-C doesn't have this ability to automagically adapt a function or method to parameters, if you want inlining the relationship has to be known at definition not at point of use. However, if the benefit of inlining is only 21% for the most primitive type, a machine integer, then it is clear that that the set of types for which compile-time specialization is beneficial at all is small.

Cocoa of course already provides specialized collection classes for the byte and unichar types, NSData and NSString respectively. I never quite understood why this wasn't extended to the other primitive types, particularly integer and float/double. On the other hand, the omission never bothered me much, I just implemented those classes myself in MPWFoundation. MPWRealArray even has support for DisplayPostscript binary object sequences, it's that old!

Both MPWRealArray and the corresponding MPWIntArray classes are small and fairly trivial to implement, and once I have them, using a specialized integer or real array is at least as convenient as using an NSArray, just a lot faster. They could also be quite a bit smaller than they are, sharing code either via subclassing or poor-man's generic programming via include files. Once I have a nice OO interface, I can swap out the implementation for something really quick like a dual-pivot integer sort I found in Java-land and adapted to C. (It is surprising just how similar they are at that level). With that sort, the test time drops to 0.56 ms, so 42% faster than the Swift version and almost twice as fast as the system qsort() function.

So the takeaway is that if you are using NSNumber objects in performance-sensitive code: stop. This is always a mistake. The native number types for Objective-C are int, float, double and friends, not NSNumber. After all, how do you perform arithmetic? Either directly on a primitive or by unboxing the NSNumber and then performing arithmetic on the primitive and then reboxing. Use primitive scalar types as much as possible where they make sense.

A second takeaway is that the question "which language is faster" doesn't really make sense, a more relevant and interesting question is "does this language make it hard/possible/easy to write fast code". Objective-C lets you write really fast code, if you want to, because it has the low-level chops and an understandable performance model. Swift so far can achieve reasonable performance at times, ludicrously bad at other times (especially with the optimizer turned off, which hardly fazes Objective-C), with as far as I can tell fairly little predictability or control. Having 10% faster (or slower) performance for code I don't particularly care about is not worth nearly as much as the knowledge that I can get the 1-5% of code that I do care about in shape no matter what. Swift is definitely not there yet, and given the direction it is taking I am not sure whether it will allow that kind of control, at least in comprehensible ways.

A third point is something more general about language. The whole argument that NSNumber and NSArray are "built in" somehow and int is not displays a lack of understanding of Objective-C that to me seems staggering. Even more so, the whole idea that you must only use what comes provided with Cocoa and are not allowed to build your own flies in the face of modern language design, throwing us back to the times of BASIC (Arthur Luehrmann, in the comments):

I had added graphics primitives to Dartmouth Basic around 1976 and developed an X-Y pen-plotter to carry out graphics commands mixed in with the text being sent to Teletype terminals.
The idea is that is that a language is a bundle of features, or to put it linguistically, a language is a list of words to be used as is.

Both C and Pascal introduced me to a new notion: that languages are not lists of words, but means of constructing your own words. For example, C did/does not have I/O as a language feature. I/O was just another set of functions placed in a library that you included just like any of your own functions/libraries. And there were two sets of them, the stdio package and the raw Unix I/O.

At around the same time I was introduced to both top-down and bottom-up programming. Both assume there is a recursive de-composition of the problem at hand (assuming the problem sufficiently complex to warrant it).

In bottom-up programming, you build up the vocabulary (the procedures and functions) that are necessary to succinctly describe your top-level problem, and then you describe your program in terms of that vocabulary you created. In top-down programming, you start at the other end and write your top-level program in terms of the vocabulary you wish you had to optimally describe the problem. Then you fill in the blanks.

In both, you define your own language to fit the problem, then you solve the problem using the language you defined. You would not add plotting commands to the language, you would either add plotting commands as a library or, if that were not possible, a way of adding plotting commands as a library. You would not look at whether plotting comes with the "standard library" or not. To quote Guy Steele in Growing a Language:

This is the nub of what I want to say. A language design can no longer be a thing. It must be a pattern—a pattern for growth—a pattern for growing the pattern for defining the patterns that programmers can use for their real work and their main goal.
So build your own libraries, your own abstractions. It's easy, fun and useful. It's the heart of Domain Driven Design, probably the most productive and effective software construction technique we as an industry have come up with to date. See what abstractions you can build easily and which ones are hard. Analyze the latter and you have started on the road to modern language design.

CORRECTION (June 4th 2015): I misattributed the Dartmouth BASIC quote to Cathy Doser, when the comment line on the Macintosh folklore entry clearly said Arthur Luehrmann. (Cathy's comment was a bit earlier).

Tuesday, May 27, 2014

Live objects vs. static types for code completion in Objective-Smalltalk

Objective-Smalltalk is now getting into a very nice virtuous cycle of being more useful, therefore being used more and therefore motivating changes to make it even more useful. One of the recent additions was autocomplete, for both the tty-based and the GUI based REPLs.

I modeled the autocomplete after the one in bash and other Unix shells: it will insert partial completions without asking up the point that they become ambiguous. If there is no unambiguous partial completion, it displays the alternatives. So a usual sequence is <TAB> -> something is inserted <TAB> again -> list is displayed, type one character to disambiguate, <TAB> again and so on. I find that I get to my desired result much quicker and with fewer backtracks than with the mechanism Xcode uses.

Fortunately, I was able to wrestle NSTextView's completion mechanism (in ShellView borrowed from the excellent FSCript) to provide these semantics rather than the built in ones.

Another cool thing about the autocomplete is that it is very precise, unlike for example FScript which as far as I can tell just offers all possible symbols. How can this be, when Objective-Smalltalk is (currently) dynamically typed and we all know that good autocomplete requires static types? The reason is simply that there is one thing that's even better than having the static types available: having the actual objects themselves available!

The two REPLs aren't just syntax-aware, they also evaluate the expression as much as needed and possible to figure out what a good completion might be. So instead of having to figure out the type of the object, we can just ask the object what messages it understands. This was very easy to implement, almost comically trivial compared to a full blown static type-system.

So while static types are good for this purpose, live objects are even better! The Self team made a similar discovery when they were working on their optimizing compiler, trying both static type inference and dynamic type feedback. Type feedback was both simpler and performed vastly better and is currently used even for optimizing statically typed languages such as Java.

Finally, autocomplete also works with Polymorphic Identifiers, for example file:./a<TAB> will autocomplete files in the current directory starting with the letter 'a' (and just fi<TAB> will autocomplete to the file: scheme). Completion is scheme-specific, so any schemes you add can provide their own completion logic.

Like all of Objective-Smalltalk, this is still a work in progress: not all syntactic constructs support completions, for example Polymorphic Identifiers don't support complex paths and there is no bracket matching. However, just like Objective-Smalltalk, what is there is quite useful and often already better what else is out there in small areas.

HN

Saturday, May 3, 2014

The sp(id)y subset, or Avoiding Copeland 2010 with Objective-C 1984

In my recent post on Cargo Cult Typing, I mentioned a concept I called the id subset. Briefly, it is the subset of Objective-C that deals only with object pointers, or id's. There has been some misunderstanding that I am opposed to types. I am not, but more on that another time.

One of the many nice properties of the (transitive) id subset is that it is dynamically (memory) safe, just like Smalltalk. That is, as long as all arguments and return values of your message are objects, you can never dereference a pointer incorrectly, the worst that can happen is that you get a "Message not understood" that can be caught and handled by the object in question or raised as an exception. The reason this is safe is that objc_msgSend() will make sure that methods will only ever be invoked on objects of the correct class, no matter what the (possibly incorrect, or unavailable) static type says.

So no de-referencing an incorrect pointer, no scribbling over random bits of memory. In fact, this is the vaunted "pointer safety" that John Siracusa says requires ditching native compiled languages like Objective-C for VM based languages. The idea that a VM with an interpreter or a JIT was required for pointer safety was never true, of course, and it's interesting that both Google and Microsoft are turning to Ahead of Time (AOT) compilation in their newest SDKs, for performance reasons.

Did someone mention "performance"? :-)

Another nice aspect of the id subset is that it makes reflective code a lot simpler. And simplicity usually also translates to speed. How much speed? Apple's NSInvocation class has to deal with interpreting C type information at runtime to then construct proper stack frames dynamically for all possible C types. I think it uses libffi, though it may be some equivalent library. This is slow, around 340.1ns per message send on my 13" MBPR. By restricting itself to the id subset, my own MPWFastInvocation class's dispatch is much simpler, just a switch invoking objc_msgSend() with a different number of arguments.

The simplicity of MPWFastInvocation also pays off in speed: 6.2ns per message-send on the same machine. That's 50 times faster than NSInvocation and only 2-3x slower than a normal message send. In fact, once you're that close, things like IMP-caching (4 ns) start to make sense, especially since they can be hidden behind a nice interface. Using a C Macro and the IMP stashed in a public instance var takes the time down to 3 ns, making the reflective call via an object effectively as fast as the non-reflective code emitted by the compiler. Which is nice, because it makes reflective techniques much more feasible for wider varieties of code, which would be a good thing.

The speed improvement is not because MPWFastInvocation is better than NSInvocation, it is decidedly not, it is because it is solving a much, much simpler problem. By sticking to the safe id subset.

On HN.

Saturday, March 15, 2014

The Siren Call of KVO and (Cocoa) Bindings

The Call of the Cool

I like bindings. I also like Key Value Observing. What they do is undeniably cool: you do some initial setup, and presto: magic! You change a value over here, and another value over there changes as well. Action at a distance. Power.

What they do is also undeniably valuable. I'd venture that nobody actually likes writing state maintenance and update code such as the following: when the user clicks this button, or finishes entering text in that textfield, take the value and put it over here. If the underlying value changes, update the textfield. If I modify this value, notify these clients that the value has changed so they can update themselves accordingly. That's boring. There is no glory in state maintenance code, just potential for failure when you screw up something this simple.

Finally, their implementation is also undeniably cool: observing an attribute of a generic object creates a private subclass for that object (who says we can't do prototype-based programming in Objective-C?), swizzles the object's class pointer to that private subclass and then replaces the attribute's (KVO-compliant) accessor methods with new ones that hook into the KVO system.

Despite these positives, I have actively removed bindings code from projects I have worked on, don't use either KVO or bindings myself and generally recommend staying away from them. Why on earth would I do that?

Excursion: Constraint Solvers

Before I can answer that question, I have to go back a little and talk about constraint solvers.

The idea of setting up relationships once and then having the system maintain them without manually shoveling values back and forth is not exactly new, the first variant I am aware of was Sketchpad, Ivan Sutherland's PhD Thesis from 1961/63 (here with narration by Alan Kay):
I still love Ivan's answer to the question as to how he could invent computer graphics, object orientation and constraint solving in one fell swoop: "I didn't know it was hard".

The first system I am aware of that integrated constraint solving with an object-oriented programming language was ThingLab, implemented on top of Smalltalk by Alan Borning at Xerox PARC around 1978 (where else...):


I really recommend having a look at the ThingLab papers, for example The Programming Language Aspects of ThingLab, a Constraint-Oriented Simulation Laboratory (pdf). Among the features ThingLab adds to Smalltalk are Paths, symbolic references to parts of an object.

While the definition of a paths is simple, the idea behind it has proved quite powerful and has been essential in allowing constraint- and object-oriented metaphors to be integrated. [..] The notion of a path helps strengthen [the distinction between inside and outside of an object] by providing a protected way for an object to provide external reference to its parts and subparts.
Yes, that's a better version of KVC. From 1981. Alan Borning's group at the University of Washington continued working on constraint solvers for many years, with the final result being the Cassowary linear constraint solver (based on the simplex algorithm) that was picked up by Apple for Autolayout. The papers on Cassowary and constraint hierarchies should help with understanding why Autolayout does what it does.

A simpler form of constraints are one-way dataflow constraints.

A one-way, dataflow constraint is an equation of the form y = f(x1,...,xn) in which the formula on the right side is automatically re-evaluated and assigned to the variable y whenever any variable xi. If y is modified from outside the constraint, the equation is left temporarily unsatisfied, hence the attribute “one-way”. Dataflow constraints are recognized as a powerful programming methodology in a variety of contexts because of their versatility and simplicity. The most widespread application of dataflow constraints is perhaps embodied by spreadsheets.
A group at CMU built enough of these systems that after using them for 10-15 years they were able to publish experience reports that are very much worth reading: Lessons Learned About One-Way, Dataflow Constraints in the Garnet and Amulet Graphical Toolkits (pdf) or the slightly more comprehensive Postscript version.

The most important lessons they found were the following:

  1. constraints should be allowed to contain arbitrary code that is written in the underlying toolkit language and does not require any annotations, such as parameter declarations
  2. constraints are difficult to debug and better debugging tools are needed
  3. programmers will readily use one-way constraints to specify the graphical layout of an application, but must be carefully and time-consumingly trained to use them for other purposes.
However, these really are just the headlines, and particularly for Cocoa programmers the actual reports are well worth reading as they contain many useful pieces of information that aren't included in the summaries.

Back to KVO and Cocoa Bindings

So what does this history lesson about constraint programming have to do with KVO and Bindings? You probably already figured it out: bindings are one-way dataflow constraints, specifically with the equation limited to y = x1. more complex equations can be obtained by using NSValueTransformers. KVO is more of an implicit invocation mechanism that is used primarily to build ad-hoc dataflow constraints.

The specific problems of the API and the implementation have been documented elsewhere, for example by Soroush Khanlou and Mike Ash, who not only suggested and implemented improvements back in 2008, but even followed up on them in 2012. All these problems and workarounds demonstrate that KVO and Bindings are very sophisticated, complex and error prone technologies for solving what is a simple and straightforward task: keeping data in sync.

To these implementation problems, I would add performance: even just adding the willChangeValueForKey: and didChangeValueForKey: message sends in your setter (these are usually added automagically for you) without triggering any notifications makes that setter 30 times slower (from 5ns to 150ns on my computer) than a simple setter that just sets and retains the object.


-(void)setFoo:newFoo
{
    [newFoo retain];
    [foo release];
    foo=newFoo;
}

-(void)setFoo:newFoo
{
    [self willChangeValueForKey:@"foo"];
    [newFoo retain];
    [foo release];
    foo=newFoo;
    [self didChangeValueForKey:@"foo"];
}

One of these is 30 times slower than the other

Actually having that access trigger a notification takes the penalty to a factor of over 100 ( 5ns vs over 540ns), even when there is only a single observer. I am pretty sure it gets worse when there are lots of observers (there used to be an O(n^3) algorithm in there, that was fortunately fixed a while ago). While 500ns may not seem a lot when dealing with UI code, KVO tends to be implemented at the model layer in such a way that a significant number of model data accesses incur at least the base penalties. For example KVO notifications were one of the primary reasons for NSOperationQueue's somewhat anemic performance back when we measured it for the Leopard release.

Not only is the constraint graph not available at run time, there is also no direct representation at coding time. All there is either code or IB settings that construct such a graph indirectly, so the programmer has to infer the graph from what is there and keep it in her head. There are also no formulae, the best we can do are ValueTransformers and keyPathsForValuesAffectingValueForKey.

As best as I can tell, the reason for this state of affairs is that there simply wasn't any awareness of the decades of research and practical experience with constraint solvers at the time (How do I know? I asked, the answer was "Huh?").

Anyway, when you add it all up, my conclusion is that while I would really, really, really like a good constraint solving system (at least for spreadsheet constraints), KVO and Bindings are not it. They are too simplistic, too fragile and solve too little of the actual problem to be worth the trouble. It is easier to just write that damn state maintenance code, and infinitely easier to debug it.

I think one of the main communication problems between advocates for and critics of KVO/Bindings is that the advocates are advocating more for the concept of constraint solving, whereas critics are critical of the implementation. How can these critics not see that despite a few flaws, this approach is obviously The Right Thing™? How can the advocates not see the obvious flaws?

Functional Reactive Programming

As far as I can tell, Functional Reactive Programming (FRP) in general and Reactive Cocoa in particular are another way of scratching the same itch.

[..] is an integration of declarative [..] and imperative object-oriented programming. The primary goal of this integration is to use constraints to express relations among objects explicitly -- relations that were implicit in the code in previous languages.
Sounds like FRP, right? Well, the first "[..]" part is actually "Constraint Imperative Programming" and the second is "constraints", from the abstract of a 1994 paper. Similarly, I've seen it stated that FRP is like a spreadsheet. The connection between functional programming and constraint programming is also well known and documented in the literature, for example the experience report above states the following:
Since constraints are simply functional programming dressed up with syntactic sugar, it should not be surprising that 1) programmers do not think of using constraints for most programming tasks and, 2) programmers require extensive training to overcome their procedural instincts so that they will use constraints.
However, you wouldn't be able to tell that there's a relationship there from reading the FRP literature, which focuses exclusively on the connection to functional programming via functional reactive animations and Microsoft's Rx extensions. Explaining and particularly motivating FRP this way has the fundamental problem that whereas functional programming, which is per definition static/timeless/non-reactive, really needs something to become interactive, reactivity is already inherent in OO. In fact, reactivity is the quintessence of objects: all computation is modeled as objects reacting to messages.

So adding reactivity to an object-oriented language is, at first blush, non-sensical and certainly causes confusion when explained this way. I was certainly confused, because until I found this one paper on reactive imperative programming, which adds constraints to C++ in a very cool and general way, none of the documentation, references or papers made the connection that seemed so blindingly obvious to me. I was starting to question my own sanity.

Architecture

Additionally, one-way dataflow constraints creating relationships between program variables can, as far as I can tell, always be replaced by a formulation where the dependent variable is simply replaced by a method that computes the value on-demand. So instead of setting up a constraint between point1.x and point2.x, you implement point2.x as a method that uses point1.x to compute its value and never stores that value. Although this may evaluate more often than necessary rather than memoizing the value and computing just once, the additional cost of managing constraint evaluation is such that the two probably balance.

However, such an implementation creates permanent coupling and requires dedicated classes for each relationship. Constraints thus become more of an architectural feature, allowing existing, usually stateful components to be used together without having to adapt each component for each individual ensemble it is a part of.

Panta Rhei

Everything flows, so they say. As far as I can tell, two different communities, the F(R)P people and the OO people came up with very similar solutions based on data flow. The FP people wanted to become more reactive/interactive, and achieved this by modeling time as sequence numbers in streams of values, sort of like Lucid or other dataflow languages.

The OO people wanted to be able to specify relationships declaratively and have their system figure out the best way to satisfy those constraints, with a large and useful subset of those constraints falling into the category of the one-way dataflow constraints that, at least to my eye, are equivalent to FRP. In fact, this sort of state maintenance and update-propagation pops up in lots of different places, for example makefiles or other build systems, web-server generators, publication workflows etc. ("this OmniGraffle diagram embedded as a PDF into this LaTeX document that in turn becomes a PDF document" -> the final PDF should update automatically when I change the diagram, instead of me having to save the diagram, export it to PDF and then re-run LaTeX).

What's kind of funny is that these two groups seem to have converged in essentially the same space, but they seem to not be aware of each other, maybe they are phase-shifted with respect to each other? Part of that phase-shift is, again, communication. The FP guys couch everything in must destroy all humans er state rethoric, which doesn't do much to convince OO guys who know that for most of their programs, state isn't an implementation detail but fundamental to their applications. Also practical experience does not support the idea that the FP approach is obvious:

Unfortunately, given the considerable amount of time required to train students to use constraints in a non-graphical manner, it does not seem reasonable to expect that constraints will ever be widely used for purposes other than graphical layout. In retrospect this result should not have been surprising. Business people readily use constraints in spreadsheets because constraints match their mental model of the world. Similarly, we have found that students readily use constraints for graphical layout since constraints match their mental model of the world, both because they use constraints, such as left align or center, to align objects in drawing editors, and because they use constraints to specify the layout of objects in precision paper sketches, such as blueprints. However, in their everyday lives, students are much more accustomed to accomplishing tasks using an imperative set of actions rather than using a declarative set of actions.
Of course there are other groups hanging out in this convergence zone, for example the Unix folk with their pipes and filters. That is also not too surprising if you look at the history:
So, we were all ready. Because it was so easy to compose processes with shell scripts. We were already doing that. But, when you have to decorate or invent the name of intermediate files and every function has to say put your file there. And the next one say get your input from there. The clarity of composition of function, which you perceived in your mind when you wrote the program, is lost in the program. Whereas the piping symbol keeps it. It's the old thing about notations are important.
I think the familiarity with Unix pipes also increases the itch: why can't I have that sort of thing in my general purpose programming language? Especially when it can lead to very concise programs, such as the Quartz-like graphics subsystem Gezira written in under 400 lines of code using the Nile dataflow language.

Moving Forward

I too have heard the siren sing. I also think that a more spreadsheet-like programming model would not just make my life as a developer easier, it might also make software more approachable for end-user adaptation and tinkering, contributing to a more meaningful version of open source. But how do we get there? Apart from a reasonable implementation and better debuggingsupport, a new system would need much tighter language integration. Preferably there would be a direct syntax for expressing constraints such as that available in constraint imperative programming languages or constraint extensions to existing languages like Ruby or JavaScript. This language support should be unified as much as possible between different constraint systems, not one mechanism for Autolayout and a completely different one for Bindings.

Supporting constraint programming has always been one of the goals of my Objective-Smalltalk project, and so far that has informed the PolymorphicIdentifiers that support a uniform interface for data backed by different types of stores, including one or more constraint stores supporting cooperating solvers, filesystems or web-sites. More needs to be done, such as extending the data-flow connector hierarchy to conceptually integrate constraints. The idea is to create a language that does not actually include constraints in its core, but rather provides sufficient conceptual, expressive and implementation flexibility to allow users to add such a facility in a non-ad-hoc way so that it is fully integrated into the language once added. I am not there yet, but all the results so far are very promising. The architectural focus of Objective-Smalltalk also ties in well with the architectural interpretation of constraints.

There is a lot to do, but on the other hand I think the payback is huge, and there is also a large body of existing theoretical, practical and empirical groundwork to fall back on, so I think the task is doable. Your feedback, help and pull requests would be very much appreciated!

Discussion on Hacker News.

Update: I finally have some code and a brief article discussing it.

Tuesday, October 15, 2013

Should you use CoreData?

The answer, of course, is "it depends".

Now that we have that out of the way, I want to have a look at Drew Crawford's You should use Core Data, which manages to come up with a less nuanced answer in its 2943 words. It's an older article (2012), but recently came to my attention via Drew McCormack (@drewmccormack): "Great post", he wrote, and after reading the article I not just disagreed, but found that Twitter wasn't really adequate for writing up all the things wrong with that article.

Where to start? Maybe at the beginning, right in the first paragraph the following is called out as a "myth":

Among them, Core Data is designed for something–not really sure what–but whatever it is, it’s a lot more complicated than what I need to do in my project.
First here is a categorical mistake, because unless Drew knows the exact requirements and engineering trade-offs in every iOS application, he can't know whether this is true or false, fact or myth.

The second mistake is that the basic statement "CoreData was designed for something else" is actually true. CoreData's design dates back to NeXT's Enterprise Object Framework or EOF for short, and EOF was designed as an ORM for talking to corporate relational database servers, with a variety of alternate back-ends for non-relational DBs (including the way cool 3270 adapter!).

Obviously the implementation is different and the design has diverged by now, but that is the basic design, and yes, that does do something that is more complicated than what some (many?) developers need.

Details.

Next sentence, next problem:

I just want to save some entities to disk.
I may well be reading too much into this, but using the word entities already bakes so many assumptions into the problem statement. When I actually do want to save entities, databases in general and CoreData in particular are somewhat higher on my list of technologies, but quite often I don't start with ERM, and therefore just have objects, XML or other data. While these could be ER-modeled with varying degrees of difficulty/success, they certainly don't have to be, and it's often not the best choice.

In the enterprise system described in REST: Advanced Research Topics and Practical Applications, we removed the database from the rewrite, because the variety of the data meant converting the DB into a key-value store one way or another, or having a schema that's about an A0 page in small print (a standards body had developed such a schema and I think we even got a poster). We instead ended up converging on an EventPoster architecture that kept the original XML feed files around and parsed them into objects as necessary. No ERM here.

The next couple of paragraphs go off on an ad-hominem straw man tangent making WAG assumptions about the provenance of iOS developers and more WAGs about why that (assumed) provenance causes said developers to have these misconceptions. Those "misconceptions" that actually turn out to be true. Although largely irrelevant, it does contain some actual misinformation. For example the fact that categories don't get linked with static libraries is not an LLMV bug, it's a consequence of the combined semantics of static libraries and Objective-C categories irrespective of compiler/linker versions.

Details.

Joel

Then there's another tangent to Joel Spolsky's article on Things You Should Never Do (the link in Drew's article is dead), such as rewriting legacy code from scratch, which Joel describes categorically as "single worst strategic mistake" that any software company can make. In the words of the great Hans Bethe: "Ach, that isn't in wrong!":

  1. Just because Joel wrote something makes it right or applicable why?
  2. While Joel makes some good points and is right to counter a tendency to not want to deal with existing code, his claim is most certainly wrong in its absoluteness, both empirically (the system referenced above was a rewrite with tremendous benefits, then there's OS X vs. trying to keep fixing Classic Mac OS indefinitely etc.) and logically: it only holds true if all your accumulated complexity is of the essential kind, and none if of the accidental kind. That idea seems ludicrous to me, virtually all software is rushed, has shifting requirements that are only understood after the fact, has historical limitations (such as having to run in 128KB of RAM on a 7MHz CPU with no MMU) etc.

    Sometimes a rewrite is warranted, though you should obviously be wary and not undertake such a project lightly.

  3. Of course the biggest non-sequitur in the whole tangent is "How on earth does the problem of reading source code and rewriting legacy systems apply to this situation?". Apart from "not at all"? We don't have access to CoreData source code and there is no question about us rewriting it. Well, actually, I used to have such access, but even though my remit would have allowed some rewrites, I don't think I could have done a better job for the technology constraints chosen. The question is whether to use it or not, including whether the technology constraints are appropriate.
  4. And no, not every persistence solution ends up as effectively a rewrite of CoreData.

Details

After that ill-conceived tangent, the article goes right back to the next unsubstantiated ad-hominem:
Here are some things you probably haven’t thought about when architecting your so-called data stack:
Apart from the attitude somewhat unbefitting to someone who gets so much wrong, well, Nein, but lets's look at some of these in detail:
  • Handling future changes to the data schema: this just isn't hard if you're not using a relational database. In fact high variability of the schema was one of the reasons for ditching the DB in you know...
  • Bi-directional synchronization with servers...hah hah...!
  • Bi-directional synchronization with peers...see above
  • Undo support: gosh, I wish there were some sort of facility that would manage undo support on my model objects, if I had my way, Apple would add this and call it NSUndoManager. Without such a useful class, I'll just have to do complicated things like renaming old versions of my data file or storing deltas.
  • Multithreading. Really? Multithreading?? If you think CoreData makes multithreading easier rather than harder, I have both a nice bridge and some oceanfront property in Nevada to sell you.
  • Device/time constraints: the performance ca(na)rd. CoreData is slow and memory intensive. It makes up for this by adding the ability and the requirement for the developer to continuously tune the working set, if that is an option. If continuously minimizing/tuning the working set is not an option (i.e. you have some large datasets), you're hosed.

Details

Magic

Then we're back to the familiar ad-hominems/straw-men and non-sequitur analogies for a couple of paragraphs, nothing to see there. The whole analogy to Cocoa is useless: yes, Cocoa is good. How does this make CoreData good (it may or may not be, there simply is no connection) or appropriate for my tasks? Also, Cocoa in particular and Apple code in general is not "magic". I've been there, I've seen it and I fixed some of it. A lot of it is good, some excellent, but some not so much (sleep(5) anyone?).

The section entitled But, there are times not to use CoreData, right? could have been the saving grace, but alas the author blows it again. Having a "mark all as read" option in a NewsReader application is a "strange" "corner case"? Right. Let me turn this section around: there is a very narrow range of application areas where CoreData is or might be appropriate. Mostly read-only data, highly regular (table-like) structure, no need for bulk operations ever, preferably no trees either and no binary data. Fairly loose performance requirements.

What Apple says

The section on "what Apple says" is also gold, though I have to give credit to the Apple doc writers for managing to strongly suggest things without actually explicitly claiming them, strongly enough to fool this particular writer:
Apple’s high-level APIs can be much faster than ‘optimized’ code at lower levels.
This is obviously true, but completely meaningless. My tricycle can be faster than a Ferrari, for example if the Ferrari is out of gas or has a flat tire, or just parked. When we actually measured the performance of applications adopting CoreData at Apple we invariably got a significant performance regression. Then a lot of effort would be expended by all the teams involved in order to fix the regression, optimization effort that hadn't been expended on the original application, usually making up some of the shortfall.

I find the code reduction claim also specious, at least when stated as an unquestionable fact like this. In my experience, code size was reduced when removing CoreData and its predecessor EOF. Had we had exactly the requirements that CoreData/EOF were designed for (talking to existing relational enterprise databases), the result would almost certainly been different, but those were not our requirements, and I doubt that most iOS apps have those requirements (and in fact, CoreData didn't even support taking to external SQL databases, at all, a puzzling development for all the EOF veterans).

For managing small amounts of data inside in iOS application, CoreData is almost always overkill, because it effectively simulates an entire client/server enterprise application stack within your phone or desktop. The performance and complexity costs to pay for that overkill are substantial.

So Should You Use Core Data?

As I wrote: it depends.

The article in question at least adds no useful information to answering that question, only inappropriate analogies, repeated claims without any supporting evidence and lots of ad-hominems that are probably meant to be witty. If you believe its last section, the only reason developers don't use CoreData is because they are naive, lazy or ignorant.

This is evidently not true and ignores even the most basic concepts that one would apply to answering the question, such as, say, fitness for purpose. CoreData may well be the best implementation of an in-process-ORM simulating a client-server enterprise app there is, and there is good evidence that this is in fact true (certainly the people on it are top notch and some of the code is amazing). However, all that doesn't help if that's not what you need.

Considering the gleefully paraded ignorance of not just alternatives but various other programming aspects, I'd take the unconditional advice this article gives with a pinch of salt. Mountain-sized pinch, that is.