Wednesday, August 26, 2015

What Happens to OO When Processors Are Free?

A while ago, I presented as a crazy thought experiment the idea of using Montecito's transistor budget for creating a chip with tens of thousand of ARM cores. Well, it seems the idea wasn't so crazy after all: The SpiNNaker project is trying to build a system with a million ARM CPUs, and it is designing a custom chip with lots of ARM cores on it.

Of course they only have 1/6th the die area of the Montecito and are using a conservative 135nm process rather than the 95nm of the Montecito or the 15nm that is state of the art, so they have a much lower transistor budget. They also use the later ARM 9 core and add 54 SRAM banks with 32KB each (from the die picture, 3 per core), so in the end they "only" put 18 cores on the chip, rather than many thousands. Using a state of the art 14nm process would mean roughly 100 times more transistors, a Montecito-sized die another factor of six. At that point, we would be at 10000 cores per chip, rather than 18.

One of the many interesting features of the SpiNNaker project is that "the micro-architecture assumes that processors are ‘free’: the real cost of computing is energy." This has interesting consequences for potentially simplifying object- or actor-oriented programming. Alan Kay's original idea of objects was to scale down the concept of "computer", so every object is essentially a self-contained computer with CPU and storage, communicating with its peers via messages. (Erlang is probably the closest implementation of this concept).

In our core-scarce computing environments, this had to be simulated by multiplexing all (or most) of the objects onto a single von Neumann computer, usually with a shared address space. If cores are free and we have them in the tens of thousands, we can start entertaining the idea of no longer simulating object-oriented computing, but rather of implementing it directly by giving each object its own core and attached memory. Yes, utilization of these cores would probably be abysmal, but with free cores low utilization doesn't matter, and low utilization (hopefully) means low power consumption.

Even at 1% utilization, 10000 cores would still mean throughput equivalent to 100 ARM 9 cores running full tilt, and I am guessing pretty low power consumption if the transistors not being used are actually off. More important than 100 core-equivalents running is probably the equivalent of 100 bus interfaces running at full tilt. The aggregate on-chip memory bandwidth would be staggering.

You could probably also run the whole thing at lower clock frequencies, further reducing power. With each object having around 96KB of private memory to itself, we would probably be looking at coarser-grained objects, with pure data being passed between the objects (Objective-C or Erlang style) and possibly APL-like array extensions (see OOPAL). Overall, that would lead to de-emphasis of expression-oriented programming models, and a more architectural focs.

This sort of idea isn't new, the Transputer got there in the late 80ies, but it was conceived when Moore's law didn't just increase transistor counts, but also clock-frequencies, and so Intel could always bulldozer away more intelligent architectures with better fabs. This has stopped, clock-frequencies have been stagnant for a while and even geometries are starting to stutter. So maybe now the time for intelligent CPU architectures has finally come, and with it the impetus for examining our assumptions about programming models.

As always, comments welcome here or on Hacker News.

UPDATE: The kilo-cores are here:

  • Kilocore: 1000 processors, 1.78 Trillion ops/sec, and at 1.78pJ/Op super power-efficient, so at 150 GOps/s only uses 0.7 watts. On a 32nm process, so not yet maxed out.
  • GRVI Phalanx joins The Kilocore Club: 1680 cores.
No reports of any of them running actors, but ensembles might work :-)

1 comment:

Unknown said...

It's not that simple, naturally. Computation costs power, of course, but so does communication from object to object. And so does an idle processor (transistors leak)- unless you've powered it down, and naturally simply powering them up and down cost time (a lot) and power. There is no free lunch here. Still, 1 professor per object is an interesting idea, There are obvious benefits ( no communication costs to object memory, though powering those down would lose their state- time for non volatile storage) and probably far more hidden costs than a simple analysis will reveal