Saturday, June 28, 2014

Compiler Writers Gone Wild: ARC Madness

In this week's episode of CWGW: This can't possibly crash, yet crash it does.

In a project I am currently working on, the top crash for the last week or so has been the following NSOutlineView delegate method:

- (BOOL)outlineView:(NSOutlineView *)outlineView isGroupItem:(id)item
{
    return NO;
}
The team had been ignoring it, because it just didn't make any sense and they had other things to do. (Fortunately not too many other crashes, the app is pretty solid at this point). When they turned to me, I was also initially puzzled, because all this should do on x86 is stuff a zero into %eax and return. This cannot possibly crash[1], so everyone just assumed that the stack traces were off, as they frequently are.

Fortunately I had just looked at the project settings and noticed that we were compiling with -O0, so optimizations disabled, and my suspicion was that ARC was doing some unnecessary retaining. That suspicion turned out to be on the money, otool -Vt revealed that ARC had turned our innocuous return NO; into the following monstrosity:

-[SomeOutlineViewDelegeate outlineView:isGroupItem:]:
00000001001bfdb0        pushq   %rbp
00000001001bfdb1        movq    %rsp, %rbp
00000001001bfdb4        subq    $0x30, %rsp
00000001001bfdb8        leaq    -0x18(%rbp), %rax
00000001001bfdbc        movq    %rdi, -0x8(%rbp)
00000001001bfdc0        movq    %rsi, -0x10(%rbp)
00000001001bfdc4        movq    $0x0, -0x18(%rbp)
00000001001bfdcc        movq    %rax, %rdi
00000001001bfdcf        movq    %rdx, %rsi
00000001001bfdd2        movq    %rcx, -0x30(%rbp)
00000001001bfdd6        callq   0x10027dbaa             ## symbol stub for: _objc_storeStrong
00000001001bfddb        leaq    -0x20(%rbp), %rdi
00000001001bfddf        movq    $0x0, -0x20(%rbp)
00000001001bfde7        movq    -0x30(%rbp), %rsi
00000001001bfdeb        callq   0x10027dbaa             ## symbol stub for: _objc_storeStrong
00000001001bfdf0        leaq    -0x20(%rbp), %rdi
00000001001bfdf4        movabsq $0x0, %rsi
00000001001bfdfe        movl    $0x1, -0x24(%rbp)
00000001001bfe05        callq   0x10027dbaa             ## symbol stub for: _objc_storeStrong
00000001001bfe0a        movabsq $0x0, %rsi
00000001001bfe14        leaq    -0x18(%rbp), %rax
00000001001bfe18        movq    %rax, %rdi
00000001001bfe1b        callq   0x10027dbaa             ## symbol stub for: _objc_storeStrong
00000001001bfe20        movb    $0x0, %r8b
00000001001bfe23        movsbl  %r8b, %eax
00000001001bfe27        addq    $0x30, %rsp
00000001001bfe2b        popq    %rbp
00000001001bfe2c        retq
00000001001bfe2d        nopl    (%rax)
Yikes! Of course, this is how ARC works: it generates an insane amount of retains and releases (hidden inside objc_storeStrong()), then relies on a special optimization pass to remove the insanity and leave behind the necessary retains/releases. Turn on the "standard" optimization -Os and we get the following, much more reasonable result:
-[WLTaskListsDataSource outlineView:isGroupItem:]:
00000001000e958a        pushq   %rbp
00000001000e958b        movq    %rsp, %rbp
00000001000e958e        xorl    %eax, %eax
00000001000e9590        popq    %rbp
00000001000e9591        retq
Much better!

It isn't clear why those retains/releases were crashing, all the objects involved looked OK in the debugger, but at least we will no longer be puzzled by code that can't possibly crash...crashing, and therefore have a better chance of actually debugging it.

Another issue is performance. I just benchmarked the following equivalent program:

#import 

@interface Hi:NSObject {}
-(BOOL)doSomething:arg1 with:arg2;
@end

@implementation Hi
-(BOOL)doSomething:arg1 with:arg2
{
  return NO;
}
@end

int main( int argc, char *argv[] ) 
{
  Hi *hi=[Hi new];
  for (int i=0;i < 100000000; i++ ) {
    [hi doSomething:hi with:hi];
  } 
  return 0;
}
On my 13" MBPR, it runs in roughly 0.5 seconds with ARC disabled and in 13 seconds with ARC enabled. That's 26 time slower, meaning we now have a highly non-obvious performance model, where performance is extremely hard to predict and control. The simple and obvious performance model was one of the main reasons Objective-C code tended to actually be quite fast if even minimal effort was expended on performance, despite the fact that some parts of Objective-C aren't all that fast.

I find the approach of handing off all control and responsibility to the optimizer writers worrying. My worries stem partly from the fact that I've never actually had that work in the past. With ARC it also happens that the optimizer can't figure out a retain/release isn't needed, so you need to sprinkle a few __unsafe_unretains throughout your code (not many, but you need to figure out which).

Good optimization has always been something that needed a human touch (with automatic assistance), the message "just trust the compiler" doesn't resonate with me. Especially since, and this is the other part I am worried about, compiler optimizations have been getting crazier and crazier, clang for example thinks there is nothing wrong with producing two different values for de-referencing the same pointer (at the same time, with no stores in-between (source: http://blog.regehr.org/archives/767):

#include 
#include 
 
int main() {
  int *p = (int*)malloc(sizeof(int));
  int *q = (int*)realloc(p, sizeof(int));
  *p = 1;
  *q = 2;
  if (p == q)
    printf("%d %d\n", *p, *q);
}
I tested this with clang-600.0.34.4 on my machine and it also gives this non-sensical result: 1 2. There are more examples, which I also wrote about in my post cc -Osmartass. Of course, Swift moves further in this direction, with expensive default semantics and reliance on the compiler to remove the resulting glaring inefficiencies.

In what I've seen reported and tested myself, this approach results in differences between normal builds and -Ofast-optimized builds of more than a factor of 100. That's not close to being OK, and it makes code much harder to understand and optimize. My guess is that we will be looking at assembly a lot more when optimizing Swift than we ever did in Objective-C, and then scratching our heads as to why the optimizer didn't manage to optimize that particular piece of code.

I fondly remember the "Java optimization" WWDC sessions back when we were supposed to rewrite all our code in that particular new hotness. In essence, we were given a model of the capabilities of HotSpot's JIT optimizer, so in order to optimize code we had to know what the resulting generated code would be, what the optimizer could handle (not a lot), and then translate that information back into the original source code. At that point, it's simpler to just write yourself the assembly that you are trying to goad the JIT into emitting for you. Or portable Macro Assembler. Or object-oriented portable Macro Assembler.

Well it could if the stack had previously reached its limit
Discuss here or on HN

5 comments:

Anonymous said...

I would think, that dereferencing "p" after the call to "realloc" (in particular, storing something into that location without first making sure, that p == q) invokes Undefined Behaviour (TM), since "realloc" could just as well have moved the memory, invalidating the originally allocated address. So, technically, clang would have the right to make the snippet do, whatever I feels like...

Marcel Weiher said...

That is the justification, yes. Doesn't change the fact that it is silly.

Matthew Fernandez said...

How is it silly? If you write something that invokes undefined behaviour, the compiler is free to do anything up to and including make demons fly out of your nose. You can't write broken code and complain that the compiler didn't agree with your invented semantics of the chosen language. Well, you can, but you won't get much sympathy.

Unknown said...

Best to assume that realloc always *re*-allocs. If the standard explicitly declares that realloc does not realloc sometimes (it declares no such thing: http://man.cx/realloc), then you could do what you did, otherwise, the previous pointer points to freed memory, and what you did was summon arbitrary demons.

Otherwise, I liked the post. Automatic optimization is always a bit hairy and very hard to lean on. For all the noise that gets made about optimization and smart compilers, there's really something to be said for a transparent execution/performance model.

Have you seen this?: http://prog21.dadgum.com/40.html

Why bother said...

I don't get your point from the benchmark program. 13 seconds is definitely got from a Debug build. Please use Release configuration for any benchmark claims.