Tuesday, September 15, 2020

Pointers are Easy, Optimization is Complicated

Just recently came across Ralf Jung's 2018 post titled Pointers are Complicated. The central thesis is that the model that most C (and assembly language) programmers have that a pointer is just an integer that happens to be a machine address is wrong, in fact the author flat out states: "Pointers are definitely not integers."

That's a strong statement. I like strong statements, because they make a discussion possible. So let's respond in kind: the claim that pointers are definitely not integers is wrong.

The example

The example the author uses to show that pointers are definitely not integers is the following:
int test() {
    auto x = new int[8];
    auto y = new int[8];
    y[0] = 42;
    int i = /* some side-effect-free computation */;
    auto x_ptr = &x[i];
    *x_ptr = 23;
    return y[0];
}

And this is the crux of the reasoning:
It would be beneficial to be able to optimize the final read of y[0] to just return 42. The justification for this optimization is that writing to x_ptr, which points into x, cannot change y.
So pointers are "hard" and "not integers" because they conflict with this optimization that "would be beneficial".

I find this fascinating: a "nice to have" optimzation is so obviously more important than a simple and obvious pointer model that it doesn't even need to be explained as a possible tradeoff, never mind justified as to why the tradeoff is resolved in favor of the nice-to-have optimization.

I prefer the simple and obvious pointer model. Vastly.

This way of placing the optimizer's concerns far ahead of the programmer's is not unique, if you check out Chris Lattner's What Every C Programmer Should Know About Undefined Behavior, you will note the frequent occurrence of the phrase "enables ... optimizations". It's pretty much the only justification ever given.

I call this now industry-dominating style of programming Compiler Optimizer Creator Oriented Programming (COCOP). It was thoroughly critiqued in What every compiler writer should know about programmers or “Optimization” based on undefined behaviour hurts performance (pdf).

Pointers as Integers

There are certainly machines where pointers are not integers, the most prominent being 8086/80286 16 bit segmented mode, where a (far) pointer consists of a segment and an offset. On 8086, the segment is simply shifted left 4 bits and added to the offset, on 80286 the segment can be located anywhere in memory or not be resident, implementing a segmented virtual memory. AFAIK, these modes are simplified variants of the iAPX 432 object memory model.

What's important to note in this context is that the iAPX 432 and its memory model failed horribly, and industry actively and happily moved away from the x86 segmented model to what is called a "flat address space", common on other architectures and finally also adopted by Intel with the 386.

The salient feature of a "flat address space" is that a pointer is an integer, and in fact this eqivalence is also rather influential on CPU architecture, with address-space almost universally tied to the CPU's integer size. So although the 68K was billed as a 16 bit CPU (or 16/32), its registers were actually 32 bits, and IIRC its address ALUs were fully 32 bit, so if you wanted to do some kinds of 32 bit aritmetic, the LEA (Load Effective Address) instruction was your friend. The reason for the segmented architecturee on the 8086 was that it was a true 16 bit machine, with 16 bit registers, but Intel wanted to have a 20 bit address space.

So not only was and is there an equivalence of pointers and integers, this state of affairs was one that was actively sought and joyously received once we achieved it again. Giving it up for nice-to-have optimizations seems at best debatable, but at the very least it is something that should be discussed/debated, rather than simply assumed away.

1 comment:

Ralf said...

Author of the blog post you cite here. I noticed your post in the referrers when looking at my web server log files. ;)

You make a totally fair point! It would be very interesting to have a language that actually makes pointers simple, and then sticks with it in terms of the optimizations it performs. In my blog post I was describing the reality of LLVM and other C/C++ compiler backends today, which is why I assumed these optimizations as an immutable fact -- I consider the probability that we can change these existing backends to not do any of the provenance-based optimizations to be basically 0. So if I as a programming language researcher want to help make LLVM better and more correct, my only avenue is to find a way to do that while keeping the optimizations. I cannot just ignore reality and imagine the world would be different... or, well, I can, but then my research will be much less relevant to real compilers. Which is totally fine, it is just not what I was going for.

It's not just compilers though that make pointers complicated, it is also the C standard. You will likely respond that this is because compiler writers write the C standard, and I am not disagreeing, I am just pointing out, that in C, *normatively*, pointers are complicated the way I describe.

My hypothesis is that a language like you want it, a language where pointers are simple integers, is just not competitive. Any sizable codebase compiled in such a language will be so slow that nobody wants to use it -- you might as well use Java or OCaml or so which will likely have better performance and where pointers/references are simple again (they are not integers, but they are simple). But we won't know this for sure until someone actually does the experiment.

Lastly, let me add that I fully agree that every time a language designer adds UB, they should consider not just the optimizations this enables but also the programmer that has to make their program not be UB. C goes way too far by having UB in totally avoidable situations. I hope that for Rust, we can find a reasonable trade-off between compilers still being able to optimize code, and programmers being able to actually write correct code. Getting there does not just involve language design, it also involves tools like Miri (https://github.com/rust-lang/miri/).

Kind regards,
Ralf