Friday, November 13, 2020

M1 Memory and Performance

The M1 Macs are out now, and not only does Apple claim they're absolutely smokin', early benchmarks seem to confirm those claims. I don't find this surprising, Apple has been highly focused on performance ever since Tiger, and as far as I can tell hasn't let up since.

One maybe somewhat surprising aspect of the M1s is the limitation to "only" 16 Gigabytes of memory. As someone who bought a 16 Kilobyte language card to run the Merlin 6502 assembler on his Apple ][+ and expanded his NeXT cube, which isn't that different from a modern Mac, to a whopping 16 Megabytes, this doesn't actually seem that much of a limitation, but it did cause a bit of consternation.

I have a bit of a theory as to how this "limitation" might tie in to how Apple's outside-the-box approach to memory and performance has contributed to the remarkable achievement that is the M1.

The M1 is apparently a multi-die package that contains both the actual processor die and the DRAM. As such, it has a very high-speed interface between the DRAM and the processors. This high-speed interface, in addition to the absolutely humongous caches, is key to keeping the various functional units fed. Memory bandwidth and latency are probably the determining factors for many of today's workloads, with a single access to main memory taking easily hundreds of clock cycles and the CPU capable of doing a good number of operations in each of these clock cycles. As Andrew Black wrote: "[..] computation is essentially free, because it happens 'in the cracks' between data fetch and data store; ..".

The tradeoff is that you can only fit so much DRAM in that package for now, but if it fits, it's going to be super fast.

So how do we make sure it all fits? Well, where Apple might have been "focused" on performance for the last 15 years or so, they have been completely anal about memory consumption. When I was there, we were fixing 32 byte memory leaks. Leaks that happened once. So not an ongoing consumption of 32 bytes again and again, but a one-time leak of 32 bytes.

That dedication verging on the obsessive is one of the reasons iPhones have been besting top-of-the-line Android phone that have twice the memory. And not by a little, either.

Another reason is the iOS team's steadfast refusal to adopt tracing garbage collection as most of the rest of the industry did, and macOS's later abandonment of that technology in favor of the reference counting (RC) they've been using since NeXTStep 4.0. With increased automation of those reference counting operations and the addition of weak references, the convenience level for developers is essentially indistinguishable from a tracing GC now.

The benefit of sticking to RC is much-reduced memory consumption. It turns out that for a tracing GC to achieve performance comparable with manual allocation, it needs several times the memory (different studies find different overheads, but at least 4x is a conservative lower bound). While I haven't seen a study comparing RC, my personal experience is that the overhead is much lower, much more predictable, and can usually be driven down with little additional effort if needed.

So Apple can afford to live with more "limited" total memory because they need much less memory for the system to be fast. And so they can do a system design that imposes this limitation, but allows them to make that memory wicked fast. Nice.

Another "well-known" limitation of RC that has made it the second choice compared to tracing GC is the fact that updating those reference counts all the time is expensive, particularly in a multi-threaded environment where those updates need to be atomic. Well...

How? Problem solved. I guess it helps if you can make your own Silicon ;-)

So Apple's focus on keeping memory consumption under control, which includes but is not limited to going all-in on reference counting where pretty much the rest of the industry has adopted tracing garbage collection, is now paying off in a majory way ("bigly"? Too soon?). They can get away with putting less memory in the system, which makes it possible to make that memory really fast. And that locks in an advantage that'll be hard to duplicate.

It also means that native development will have a bigger advantage compared to web technologies, because native apps benefit from the speed and don't have a problem with the memory limitations, whereas web-/electron apps will fill up that memory much more quickly.

6 comments:

Anonymous said...

"With increased automation of those reference counting operations and the addition of weak references, the convenience level for developers is essentially indistinguishable from a tracing GC now."

I cannot imagine what would lead someone to make such a claim. I've been writing Swift and Objective-C since the MRC days, and also various languages which have a full tracing GC. Apple's platforms have made it much easier to use than it used to be, but in the convenience department, it's never going to be a match for GC.

Defaults matter, and GC's default is "it just works". With a GC, you might have to spend some time tuning it, but I've never had to spend time tracking down a memory-related crash. That's not true of any real-world Cocoa program I've worked on. (Swift is so complex that for a while the compiler would happily generate a double-free for you in some cases!) Apple's own apps still crash pretty regularly.

Minimizing memory usage is great for the hardware vendor and for the user, but it sucks for the developer. These are not just theoretical concerns. I see people online struggling with [weak self] on trivial closures every day, and no other platform has this problem. ARC might be a worthwhile tradeoff, but it's definitely a tradeoff.

You may as well try to convince me that GCD is as good as Erlang for concurrency. They're both "not plain C threads", but that's where the similarities end. Nobody would ever mistake one for the other.

Marcel Weiher said...

@Anonymous

You seem to have a very different experience from mine, and from many if not most developers. I also have written production code in various full GC, RC and fully manual languages. The big difference is the step up from fully manual, because object lifetime is a global problem and you only have local information. Even "semi-manual" RC, incorrectly called MRC, solves that problem, and after that you're talking about very slight differences.

You annotate your property as "strong" (or use an objectAccessor() ) and you're done in terms of crashes. Then you go back and add "weak" to back-references and you've taken care of leaks, and quite frankly the amount of leakage tends to be less than GC overhead. Not sure how you're managing to get crashes attributable to memory at that point.

Swift falling over its own feet is a different matter, I would say.

GCD is actually worse than threads in most cases, the whole idea of using closures for concurrency is just nuts and using them for "callbacks" is also a fairly stupid idea, for a bunch of reasons. So yes, if you're using extremely closure-heavy code, then that might be a problem, but it won't be your only problem, . Closure-heavy code is essentially non-comprehensible and non-debuggable.

And notice I wrote "the convenience level" is essentially indistinguishable, not that the overall experience is indistinguishable. What you casually dismiss as "might have to spend some time tuning it" can be a never-ending nightmare, which in the end you cannot actually solve. See the high-perf Java programs that have to roll their own memory management off the Java heap.

Compared to having to implement your own completely custom memory management, having to think for a second whether "strong" or "weak" is correct seems rather easier.



Anonymous said...

Anon., it seems the benefits you mention derive from automatic memory allocation, and they would therefore be there regardless of whether that automatic memory allocation is implemented using tracing garbage collection or reference counting.

I think reference counting was long considered too slow compared to tracing gc. But, reclaiming aggressively, it (generally) uses substantially less memory. It also does not stop unpredictably, pausing execution for an indeterminate time. With changes in the memory hierarchy relative latencies, reference counting could now outperform gc.

Weiher, annotating "strong" etc. is probably "essentially" impossible to get wrong, and you would never, I am sure, being a good, very thorough programmer ;) But if it is possible to get it wrong, some (bad programmer!) is bound to. Also, you want to have as few chores of annotating this and that as possible when coding. Freeing up mental space is very important for productivity. I agree with Anon that automatic memory management should be used except where runtime performance dictates otherwise.

Mike Conte said...

Interesting to read this from 2013

https://sealedabstract.com/rants/why-mobile-web-apps-are-slow/index.html

Billy said...

One of the most loved and fastest growing programming languages outside of Apple is Rust, which also uses retain/release by default instead of a GC. That language also compiles to WebAssembly, so I wouldn't count the web out just yet in terms of performance.

Jon Harrop said...

"The benefit of sticking to RC is much-reduced memory consumption. It turns out that for a tracing GC to achieve performance comparable with manual allocation, it needs several times the memory (different studies find different overheads, but at least 4x is a conservative lower bound)."

All of the evidence I have seen shows precisely the opposite, e.g.: https://programming-language-benchmarks.vercel.app/ocaml-vs-swift