Tuesday, August 9, 2022

Native-GUI distributed system in a tweet

If I've been a bit quiet recently it's not due to lack of progress, but rather the very opposite: so much progress in Objective-S land hat my head is spinning and I am having a hard time both processing it all and seeing where it goes.

But sometimes you need to pause, reflect, and show your work, in whatever intermediate state it currently is. So without further ado, here is the distributed system, with GUI, in a tweet:

It pops up a window with a text field, and stores whatever the user enters in an S3 bucket. It continues to do this until the user closes the window, at which point the program exits.

Of course, it's not much of a distributed system, particularly because it doesn't actually include the code for the S3 simulator.

Anyway, despite fitting in a tweet, the Objective-S script is actually not code golf, although it may appear as such to someone not familiar with Objective-S.

Instead, it is a straightforward definition and composition of the elements required:

  1. A storage combinator for interacting with data in S3.
  2. A text field inside a window, defined as object literals.
  3. A connection between the text field and a specific S3 bucket.
That's it, and it is no coincidence that the structure of the system maps directly onto the structure of the code. Let's look at the parts in detail.

S3 via Storage Combinator

The first line of the script sets up an S3 scheme handler so we can interact with the S3 buckets almost as if they were local variables. For example the following assignment statement stores the text 'Hello World!' in the "msg.txt" file of "bucket1":

   s3:bucket1/msg.txt ← 'Hello World!'
Retrieving it works similarly:

   stdout println: s3:bucket1/msg.txt
The URL of our S3 simulator is http://defiant.local:2345/, so running on host defiant in the local network, addressed by Bonjour and listening on port 2345. As Objective-S supports Polymorphic Identifiers (pdf), this URL is a directly evaluable identifier in the language. Alas, that directness poses a problem, because writing down an identifier in most programming languages yields the value of the variable the identifier identifies, and Objective-S is no exception. In the case of http://defiant.local:2345/, that value is the directory listing of the root of the S3 server, encoded as the following XML response:

<?xml version="1.0" encoding="UTF-8"?>
<ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Owner><ID>123</ID><DisplayName>FakeS3</DisplayName></Owner>
<Buckets>
<Bucket>
<Name>bucket1</Name>
<CreationDate>2022-08-10T15:18:32.000Z</CreationDate>
</Bucket>
</Buckets>
</ListAllMyBucketsResult>
That's not really what we want, we want to refer to the URL itself. The ref: allows us to do this by preventing evaluation and thus returning the reference itself, very similar to the & operator that creates pointers in C.

Except that an Objective-S reference (or more precisely, a binding) is much richer than a C pointer. One of its many capabilities is that it can be turned into a store by sending it the -asScheme message. This new store uses the reference it was created from as its base URL, all the references it receives are evaluated relative to this base reference.

The upshot is that with the s3: scheme handler defined and installed as described, the expression s3:bucket1/msg.txt evaluates to http://defiant.local:2345/bucket1/msg.txt.

This way of defining shorthands has proven extremely useful for making complex references usable and modular, and is an extremely common pattern in Objective-S code.

Declarative GUI with object literals

Next, we need to define the GUI: a window with a text field. With object literals, this is pretty trivial. Object literals are similar to dictionary literals, except that you get to define the class of the instance defined by the key/value pairs, instead of it always being a dictionary.

For example, the following literal defines a text field with certain dimensions and assigns it to the text local variable:

   text ← #NSTextField{ #stringValue:'',#frame:(10@45 extent:180@24) }.
And a window that contains the text field we just defined:
   window ← #NSWindow{ #frame:(300@300 extent:200@105),#title:'S3', #views:#[text]}.
It would have been nice to define the text field inline in its window definition, but we currently still need a variable so we can connect the text field (see next section).

Connecting components

Now that we have a text field (in a window) and somewhere to store the data, we need to connect these two components. Typically, this would involve defining some procedure(s), callback(s) or some extra-linguistics mechanism to mediate or define that connection. In Objective-S, we just connect the components:

   text → ref:s3:bucket1/msg.txt.
That's it.

The right-arrow "→" is a polymorphic connection "operator". The complete connection is actually significantly more complex:

  1. From a port of the source component
  2. To a role of the mediating connector compatible with that source port
  3. To a role of the mediating connector compatible with the target object's port
  4. To that compatible port of the target component
If you want, you can actually specify all these intermediate steps, but most of the time you don't have to, as the machinery can figure out what ports and roles are compatible. In this case, even the actual connector was determined automatically.

If we didn't want a remote S3 bucket, we could also have stored the data in a local file, for example:

   text → ref:file:/tmp/msg.txt.
That treats the file like a variable, replacing the entire contents of the file with the text that was entered. Speaking of variables, we could of course also store the text in a local variable:

   text → ref:var:message.
In our simple example that doesn't make a lot of sense because the variable isn't visible anywhere and will disappear once the script terminates, but in a larger application it could then trigger further processing.

Alternatively, we could also append the individual messages to a stream, for example to stdout:

   text → stdout.
So every time the user hits return in the text field, the content of the text field is written to the console. Or appended to a file, by connecting to the stream associated with the file rather the file reference itself:
   text → ref:file:/tmp/msg.txt outputStream.
This doesn't have to be a single stream sink, it can be a complex processing pipeline.

I hope this makes it clear, or at least strongly hints, that this is not the usual low-code/no-code trick of achieving compact code by creating super-specialised components and mechanisms that work well for a specific application, but immediately break down when pushed beyond the demo.

What it is instead is a new way of creating components, defining their interfaces and then gluing them together in a very straightforward fashion.

Eval/apply vs. connect and run

Having constructed our system by configuring and connecting components, what's left is running it. CLIApp is a subclass of NSApplication that knows how to run without an associated app wrapper or Info.plist file. It is actually instantiated by the stui script runner before the script is started, with the instance dropped into the app variable for the script.

This is where we leave our brave new world of connected components and return (or connect with) the call/return world, similar to the way Cocoa's auto-generated main with call to NSApplicationMain() works.

The difference between eval/apply (call/return) and connect/run is actually quite profound, but more on that in another post.

Of course, we didn't leave call/return behind, it is still present and useful for certain tasks, such as transforming an element into something slightly different. However, for constructing systems, having components that can be defined, configured and connected directly ("declaratively") is far superior to doing so procedurally, even than the fluent APIs that have recently popped up and that have been mislabeled as "declarative".

This project is turning out even better than I expected. I am stoked.

Monday, June 20, 2022

Blackbird: A reference architecture for local-first connected mobile apps

Wow, what a mouthful! Although this architecture has featured in a number of my other writings, I haven't really described it in detail by itself. Which is a shame, because I think it works really well and is quite simple, a case of Sophisticated Simplicity.

Why a reference architecture?

The motivation for creating and now presenting this reference architecture is that the way we build connected mobile apps is broken, and none of the proposed solutions appear to help. How are they broken? They are overly complex, require way too much code, perform poorly and are unreliable.

Very broadly speaking, these problems can be traced to the misuse of procedural abstraction for a problem-space that is broadly state-based, and can be solved by adapting a state-based architectural style such as in-process REST and combining it with well-known styles such as MVC.

More specifically, MVC has been misapplied by combining UI updates with the model updates, a practice that becomes especially egregious with asynchronous call-backs. In addition, data is pushed to the UI, rather than having the UI pull data when and as needed. Asynchronous code is modelled using call/return and call-backs, leading to call-back hell, needless and arduous transformation of any dependent code into asynchronous code (see "what color is your function") that is also much harder to read, discouraging appropriate abstractions.

Backend communication is also an issue, with newer async/await implementations not really being much of an improvement over callback-based ones, and arguably worse in terms of actual readability. (They seem readable, but what actually happens is different  enough that the simplicity is deceptive).

Overview

The overall architecture has four fundamental components:
  1. The model
  2. The UI
  3. The backend
  4. The persistence
The main objective of the architecture is to keep these components in sync with each other, so the whole thing somewhat resembles a control loop architecture: something disturbs the system, for example the user did something in the UI, and the system responds by re-establishing equilibrium.

The model is the central component, it connects/coordinates all the pieces and is also the only one directly connected to more than one piece. In keeping with hexagonal architecture, the model is also supposed to be the only place with significant logic, the remainder of the system should be as minimal, transparent and dumb as possible.

memory-model := persistence.
persistence  |= memory-model.
ui          =|= memory-model. 
backend     =|= memory-model.

Graphically:

Elements

Blackbird depends crucially on a number of architectural elements: first are stores of the in-process REST architectural style. These can be thought of as in-process HTTP servers (without the HTTP, of course) or composable dictionaries. The core store protocol implements the GET, PUT and DELETE verbs as messages.

The role of URLs in REST is taken by Polymorphic Identifiers. These are objects that can reference identify values in the store, but are not direct pointers. For example, they need to be a able to reference objects that aren't there yet.

Polymorphic Identifiers can be application-specific, for example they might consist just of a numeric id,

MVC

For me, the key part of the MVC architectural style is the decoupling of input processing and resultant output processing. That is, under MVC, the view (or a controller) make some change to the model and then processing stops. At some undefined later time (could be synchronous, but does not have to be) the Model informs the UI that it has changed using some kind of notification mechanism.

In Smalltalk MVC, this is a dependents list maintained in the model that interested views register with. All these views are then sent a #changed message when the model has changed. In Cocoa, this can be accomplished using NSNotificationCenter, but really any kind of broadcast mechanism will do.

It is then the views' responsibility to update themselves by interrogating the model.

For views, Cocoa largely automates this: on receipt of the notification, the view just needs invalidate itself, the system then automatically schedules it for redrawing the next time through the event loop.

The reason the decoupling is important to maintain is that the update notification can come for any other reason, including a different user interaction, a backend request completing or even some sort of notification or push event coming in remotely.

With the decoupled M-V update mechanism, all these different kinds of events are handled identically, and thus the UI only ever needs to deal with the local model. The UI is therefore almost entirely decoupled from network communications, we thus have a local-first application that is also largely testable locally.

Blackbird refines the MVC view update mechanism by adding the polymorphic identifier of the modified item in question and placing those PIs in a queue. The queue decouples model and view even more than in the basic MVC model, for example it become fairly trivial to make the queue writable from any thread, but empty only onto the main thread for view updates. In addition, providing update notifications is no longer synchronous, the updater just writes an entry into the queue and can then continue, it doesn't wait for the UI to finish its update.

Decoupling via a queue in this way is almost sufficient for making sure that high-speed model updates don't overwhelm the UI or slow down the model. Both these performance problems are fairly rampant, as an example of the first, the Microsoft Office installer saturates both CPUs on a dual core machine just painting its progress bar, because it massively overdraws.

An example of the second was one of the real performance puzzlers of my career: an installer that was extremely slow, despite both CPU and disk being mostly idle. The problem turned out to be that the developers of that installer not only insisted on displaying every single file name the installer was writing (bad enough), but also flushing the window to screen to make sure the user got a chance to see it (worse). This then interacted with a behavior of Apple's CoreGraphics, which disallows screen flushes at a rate greater than the screen refresh rate, and will simply throttle such requests. You really want to decouple your UI from your model updates and let the UI process updates at its pace.

Having polymorphic identifiers in the queue makes it possible for the UI to catch up on its own terms, and also to remove updates that are no longer relevant, for example discarding duplicate updates of the same element.

The polymorphic identifier can also be used by views in order to determine whether they need to update themselves, by matching against the polymorphic identifier they are currently handling.

Backend communication

Almost every REST backend communication code I have seen in mobile applications has created "convenient" cover methods for every operation of every endpoint accessed by the application, possibly automatically generated.

This ignores the fact that REST only has a few verbs, combined with a great number of identifiers (URLs). In Blackbird, there is a single channel for backend communication: a queue that takes a polymorphic identifier and an http verb. The polymorphic identifier is translated to a URL of the target backend system, the resulting request executed and when the result returns it is placed in the central store using the provided polymorphic identifier.

After the item has been stored, an MVC notification with the polymorphic identifier in question is enqueued as per above.

The queue for backend operations is essentially the same one we described for model-view communication above, for example also with the ability to deduplicate requests correctly so only the final version of an object gets sent if there are multiple updates. The remainder of the processing is performed in pipes-and-filters architectural style using polymorphic write streams.

If the backend needs to communicate with the client, it can send URLs via a socket or other mechanism that tells the client to pull that data via its normal request channels, implementing the same pull-constraint as in the rest of the system.

One aspect of this part of the architecture is that backend requests are reified and explicit, rather than implicitly encoded on the call-stack and its potentially asynchronous continuations. This means it is straightforward for the UI to give the user appropriate feedback for communication failures on the slow or disrupted network connections that are the norm on mobile networks, as well as avoid accidental duplicate requests.

Despite this extra visibility and introspection, the code required to implement backend communications is drastically reduced. Last not least, the code is isolated: network code can operate independently of the UI just as well as the UI can operate independently of the network code.

Persistence

Persistence is handled by stacked stores (storage combinators).

The application is hooked up to the top of the storage stack, the CachingStore, which looks to the application exactly like the DictStore (an in-memory store). If a read request cannot be found in the cache, the data is instead read from disk, converted from JSON by a mapping store.

For testing the rest of the app (rather than the storage stack), it is perfectly fine to just use the in-memory store instead of the disk store, as it has the same interface and behaves the same, except being faster and non-persistent.

Writes use the same asynchronous queues as the rest of the system, with the writer getting the polymorphic identifiers of objects to write and then retrieving the relevant object(s) from the in-memory store before persisting. Since they use the same mechanism, they also benefit from the same uniquing properties, so when the I/O subsystem gets overloaded it will adapt by dropping redundant writes.

Consequences

With the Blackbird reference architecture, we not only replace complex, bulky code with much less and much simpler code, we also get to reuse that same code in all parts of the system while making the pieces of the system highly independent of each other and optimising performance.

In addition, the combination of REST-like stores that can be composed with constraint- and event-based communication patterns makes the architecture highly decoupled. In essence it allows the kind of decoupling we see in well-implemented microservices architectures, but on mobile apps without having to run multiple processes (which is often not allowed).

Thursday, July 29, 2021

Glue Code is the Success Condition

My previous post titled Glue: the Dark Matter of Software may have given the impression that I see glue code as exclusively a problem. And I have to admit that my follow-up (and reaction to Github's copilot) called Don't Generate Glue...Exterminate may not have done much to dissuade anyone of that impression, but I just couldn't resist the Dalek reference.

However, I think it is important to remember that the fact that we have so much glue is a symptom of one of our biggest successes in software technology. Even as recently as the late 80s and early 90s, we just didn't have all that much to glue together, and software reuse was the holy grail, the unobtainium of computing, both in its desirability and unobtainability.

Now we have reuse. Boy do we have reuse! We have so much reuse that we need tool support to manage all the reuse. As far as I can tell, all new programming languages now come with such tooling, and are considered incomplete until they have it.

The price of success is having a new set of problems, problems you never dreamed of before.

So how will we solve these problems?

Data format adaptation, as suggested by the O'Reilly article? Yes. Model-driven approaches that allow us or our tools and languages to generate a lot of the more obvious adapter code? Sounds good, why not?

This one neat trick (click here!) that will automatically solve all these problems? No.

Simpler components, written with composability and minimization of dependencies in mind? Surely. Education, so developers get better at writing code that composes well without turning into architecture astronauts? Very much yes.

However, my contention is that developers have a hard time with this in large part because our languages only support implementing such glue, which is a start, but do not support expressing it directly, or abstracting over it, encapsulating it, playing with it. So new linguistic mechanisms like Objective-S are needed to help developers write better and thus less glue code so we can better enjoy the fruits of our reusability success.

Sunday, July 25, 2021

Deleting Code to Double the Performance of my Trivial Objective-S Tasks Backend

About two months ago, I showed a trivial tasks backend for a hypothetical ToDoMVC app. At the time, I noted that the performance was pretty insane for something written in a (slow) scripting language: 7K requests per second when fetching a single task.

That was using an encoder method (that writes key/value pairs to the JSON encoder) written in Objective-S, and I wondered how much faster it would go if that was no longer the case. Twice as fast, it turns out.

Yesterday, I wrote about tuning the Objective-S's SQLite insert performance to around 130M rows/minute, coincidentally also for a simple tasks schema. One part of that performance story was the fact that the encoder method (writing key/value pairs to the SQLite encoder) was generated by pasting together Objective-C blocks and installing the whole thing as an Objective-C method. No interpretation, except for calling a series of blocks stored in an NSArray. I had completely forgotten about the hand-written Objective-S encoder method in the back-end's Task class! Since generation is automatic, but won't override an already existing method, all I had to do in order to get the better performance was delete the old method.


> wrk -c 1 -t 1 http://localhost:8082/tasks 
Running 10s test @ http://localhost:8082/tasks
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    66.60us    9.69us   1.08ms   96.72%
    Req/Sec    14.95k   405.55    15.18k    98.02%
  150275 requests in 10.10s, 30.67MB read
Requests/sec:  14879.22
Transfer/sec:      3.04MB
> curl  http://localhost:8082/tasks         
[{"id":1,"done":0,"title":"Clean Room"},{"id":2,"done":1,"title":"Check Twitter"}]%

More than twice the performance, and that while fetching two tasks instead of just one, so around 30K tasks/second! (And yes, I checked that I wasn't hitting a 404...).

So what's the performance if we actually fetch more than a minimal number of tasks? For 128 tasks, 64x more than before, it's still around 9K requests/s, so most of the time so far was per-request overhead. At this point we are serving a little over 1M tasks/s:


> wrk -c 1 -t 1 'http://localhost:8082/tasks/' 
Running 10s test @ http://localhost:8082/tasks/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   112.13us   76.17us   5.57ms   99.63%
    Req/Sec     9.05k   397.99     9.21k    97.03%
  90923 requests in 10.10s, 483.41MB read
Requests/sec:   9002.44
Transfer/sec:     47.86MB

If memory serves, that was around the rate we were seeing with the Wunderlist backend when we had a couple of million users, not that these are comparable in any meaningful way. For 1024 tasks there's a significant drop to slightly above 1.8K requests/s, with the task-rate almost doubling to 1.8M/s:


> wrk -c 1 -t 1 'http://localhost:8082/tasks/' 
Running 10s test @ http://localhost:8082/tasks/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   552.06us   62.77us   1.84ms   81.08%
    Req/Sec     1.82k    52.95     1.89k    90.10%
  18267 requests in 10.10s, 778.36MB read
Requests/sec:   1808.59
Transfer/sec:     77.06MB

UPDATE:
Of course, those larger request sizes also see a much larger increase in performance than 2x. With the old code, the 128 task case clocks in at 147 requests/s and the 1024 task case at 18 requests/s, at which point it's a 100x improvement. So gives you an idea just how slow my Objective-S interpreter is.

Saturday, July 24, 2021

Inserting 130M SQLite Rows per Minute...from a Scripting Language

The other week, I stumbled on the post Inserting One Billion Rows in SQLite Under A Minute, which was a funny coincidence, as I was just in the process of giving my own SQLite/Objective-S adapter a bit of tune-up. (The post's title later had "Towards" prepended, because the author wasn't close to hitting that goal).

This SQLite adapater was a spin-off of my earlier article series on optimizing JSON performance, itself triggered by the ludicrously bad performance of Swift Coding at this rather simple and relevant task. To recap: Swift's JSON coder clocked in at about 10MB/s. By using a streaming approach and a bit of tuning, we got that to around 200MB/s.

Since then, I have worked on making Objective-S much more useful for UI work, with the object-literal syntax making defining UIs as convenient as the various "declarative" functional approaches such as React or SwiftUI. Except it is still using the same AppKit or UIKit objects we know and love, and doesn't force us to embrace the silly notion that the UI is a pure function of the model. Oh, and you get live previews that actually work. But more on that later.

So I am slowly inching towards doing a ToDoMVC, a benchmark that feels rather natural to me. While I am still very partial to just dumping JSON files, and the previous article series hopefully showed that this approach is plenty fast enough, I realize that a lot of people prefer a "real" database, especially on the back-end, and I wanted to build that as well. One of the many benchmarks I have for Objective-S is that it should be possible to build a nicer Rails with it. (At this point in time I am pretty sure I will hit that benchmark).

One of the ways to figure out if you have a good design is to stress-test it. One very useful stress-test is seeing how fast it can go, because that will tell you if the thing you built is lean, or if you put in unnecessary layers and indirections.

This is particularly interesting in a Scripted Components (pdf) system that combines a relatively slow but flexible interactive scripting language with fast, optimized components. The question is whether you can actually combine the flexibility of the scripting language while reaping the benefits of the fast components, rather than having to dive into adapting and optimizing the components for each use case, or just getting slow performance despite the fast components. My hunch was that the streaming approach I have been using for a while now and that worked really well for JSON and Objective-C would also do well in this more challenging setting.

Spoiler alert: it did!

The benchmark

The benchmark was a slightly modified version of the script that serves as a tasks backend. Like said sample script it also creates a tasks database and inserts some example rows. Instead of inserting two rows, it inserts 10 million. Or a hundred million.


#!env stsh
#-taskbench:dbref
#

class Task {
	var  id.
	var  done.
	var  title.
	-description { "". }
	+sqlForCreate {
		'( [id] INTEGER PRIMARY KEY, [title] VARCHAR(220) NOT NULL, [done] INTEGER );'.
	}
}.

scheme todo : MPWAbstractStore {
	var db.
	var tasksTable.
	-initWithRef:ref {
		this:db := (MPWStreamQLite alloc initWithPath:ref path).
		this:tasksTable :=  #MPWSQLTable{ #db: this:db , #tableClass: Task, #name: 'tasks'  }.
		this:db open.
		self.
	}
	-createTable {
		this:tasksTable create.
	    this:tasksTable := this:db tables at:'tasks'.
        this:tasksTable createEncoderMethodForClass: Task.
	}
	-createTaskListToInsert:log10ofSize {
		baseList ← #( #Task{  #title: 'Clean Room', #done: false }, #Task{  #title: 'Check Twitter', #done: true } ).
		...replicate ...
		taskList.
	}
	-insertTasks {
	    taskList := self createTaskListToInsert:6.
		1 to:10 do: {
			this:tasksTable insert:taskList.
		}.
	}
}.
todo := todo alloc initWithRef:dbref.
todo createTable.
todo insertTasks.

(I have removed the body of the method that replicates the 2 tasks into the list of millions of tasks we need to insert. It was bulky and not relevant.)

In this sample we define the Task class and use that to create the SQL Table. We could also have simply created the table and generated a Tasks class from that.

Anyway, running this script yields the following result.


> time ./taskbench-sqlite.st /tmp/tasks1.db 
./taskbench-sqlite.st /tmp/tasks1.db  4.07s user 0.20s system 98% cpu 4.328 total
> ls -al  /tmp/tasks1.db* 
-rw-r--r--  1 marcel  wheel   214M Jul 24 20:11 /tmp/tasks1.db
> sqlite3 /tmp/tasks1.db 'select count(id) from tasks;' 
10000000

So we inserted 10M rows in 4.328 seconds, yielding several hundred megabytes of SQLite data. This would be 138M rows had we let it run for a minute. Nice. For comparison, the original article's numbers were 11M rows/minute for CPython, 40M rows/minute for PyPy and 181M rows/minute for Rust, though on a slower Intel MacBook Pro whereas I was running this on an M1 Air. I compiled and ran the Rust version on my M1 Air and it did 100M rows in 21 seconds, so just a smidgen over twice as fast as my Objective-S script, though with a simpler schema (CHAR(6) instead of VARCHAR(220)) and less data (1.5GB vs. 2.1GB for 100M rows).

Getting SQLite fast

The initial version of the script was far, far slower, and at first it was, er, "sub-optimal" use of SQLite that was the main culprit, mostly inserting every row by itself without batching. When SQLite sees an INSERT (or an UPDATE for that matter) that is not contained in a transaction, it will automatically wrap that INSERT inside a generated transaction and commit that transaction after the INSERT is processed. Since SQLite is very fastidious about ensuring that transactions get to disk atomically, this is slow. Very slow.

The class handling SQLite inserts is a Polymorphic Write Stream, so it knows what an array is. When it encounters one, it sends itself the beginArray message, writes the contents of the array and finishes by sending itself the endArray message. Since writing an array sort of implies that you want to write all of it, this was a good place to insert the transactions:


-(void)beginArray {
    sqlite3_step(begin_transaction);
    sqlite3_reset(begin_transaction);
}

-(void)endArray {
    sqlite3_step(end_transaction);
    sqlite3_reset(end_transaction);
}

So now, if you want to write a bunch of objects as a single transaction, just write them as an array, as the benchmark code does. There were some other minor issues, but after that less than 10% of the total time were spent in SQLite, so it was time to optimize the caller, my code.

Column keys and Cocoa Strings

At this point, my guess was that the biggest remaining slowdown would be my, er, "majestic" Objective-S interpreter. I was wrong, it was Cocoa string handling. Not only was I creating the SQLite parameter placeholder keys dynamically, so allocating new NSString objects for each column of each row, it also happens that getting character data from an NSString object nowadays involves some very complex and slow internal machinery using encoding conversion streams. -UTF8String is not your friend, and other methods appear to fairly consistently use the same slow mechanism. I guess making NSString horribly slow is one way to make other string handling look good in comparison.

After a few transformations, the code would just look up the incoming NSString key in a dictionary that mapped it to the SQLite parameter index. String-processing and character accessing averted.

Jitting the encoder method. Without a JIT

One thing you might have noticed about the class definition in the benchmark code is that there is no encoder method, it just defines its instance variables and some other utilities. So how is the class data encoded for the SQLTable? KVC? No, that would be a bit slow, though it might make a good fallback.

The magic is the createEncoderMethodForClass: method. This method, as the name suggests, creates an encoder method by pasting together a number of blocks, turns the top-level into a method using imp_implementationWithBlock(), and then finally adds that method to the class in question using class_addMethod().


-(void)createEncoderMethodForClass:(Class)theClass
{
    NSArray *ivars=[theClass allIvarNames];
    if ( [[ivars lastObject] hasPrefix:@"_"]) {
        ivars=(NSArray*)[[ivars collect] substringFromIndex:1];
    }
    
    NSMutableArray *copiers=[[NSMutableArray arrayWithCapacity:ivars.count] retain];
    for (NSString *ivar in ivars) {
        MPWPropertyBinding *accessor=[[MPWPropertyBinding valueForName:ivar] retain];
        [ivar retain];
        [accessor bindToClass:theClass];
        
        id objBlock=^(id object, MPWFlattenStream* stream){
            [stream writeObject:[accessor valueForTarget:object] forKey:ivar];
        };
        id intBlock=^(id object, MPWFlattenStream* stream){
            [stream writeInteger:[accessor integerValueForTarget:object] forKey:ivar];
        };
        int typeCode = [accessor typeCode];
        
        if ( typeCode == 'i' || typeCode == 'q' || typeCode == 'l' || typeCode == 'B' ) {
            [copiers addObject:Block_copy(intBlock)];
        } else {
            [copiers addObject:Block_copy(objBlock)];
        }
    }
    void (^encoder)( id object, MPWFlattenStream *writer) = Block_copy( ^void(id object, MPWFlattenStream *writer) {
        for  ( id block in copiers ) {
            void (^encodeIvar)(id object, MPWFlattenStream *writer)=block;
            encodeIvar(object, writer);
        }
    });
    void (^encoderMethod)( id blockself, MPWFlattenStream *writer) = ^void(id blockself, MPWFlattenStream *writer) {
        [writer writeDictionaryLikeObject:blockself withContentBlock:encoder];
    };
    IMP encoderMethodImp = imp_implementationWithBlock(encoderMethod);
    class_addMethod(theClass, [self streamWriterMessage], encoderMethodImp, "v@:@" );
}

What's kind of neat is that I didn't actually write that method for this particular use-case: I had already created it for JSON-coding. Due to the fact that the JSON-encoder and the SQLite writer are both Polymorphic Write Streams (as are the targets of the corresponding decoders/parsers), the same method worked out of the box for both.

(It should be noted that this encoder-generator currently does not handle all variety of data types; this is intentional).

Getting the data out of Objective-S objects

The encoder method uses MPWPropertyBinding objects to efficiently access the instance variables via the object's accessors, caching IMPs and converting data as necessary, so they are both efficient and flexible. However, the actual accessors that Objective-S generated for its instance variables were rather baroque, because they used the same basic mechanism used for Objective-S methods, which can only deal with objects, not with primitive data types.

In order to interoperate seamlessly with Objective-C, which expected methods that can take data types other than objects, all non-object method arguments are converted to objects on the way in, and return values are converted from objects to primitive values on the way out.

So even the accessors for primitive types such as the integer "id" or the boolean "done" would have their values converted to and from objects by the interface machinery. As I noted above, I was a bit surprised that this inefficiency was overshadowed by the NSString-based key handling.

In fact, one of the reason for pursuing the SQLite insert benchmark was to have a reason for finally tackling this Rube-Goldberg mechanism. In the end, actually addressing it turned out to be far less complex than I had feared, with the technique being very similar to that used for the encoder-generator above, just simpler.

Depending on the type, we use a different block that gets parameterised with the offset to the instance variable. I show the setter-generator below, because there the code for the object-case is actually different due to retain-count handling:


#define pointerToVarInObject( type, anObject ,offset)  ((type*)(((char*)anObject) + offset))

#ifndef __clang_analyzer__
// This leaks because we are installing into the runtime, can't remove after

-(void)installInClass:(Class)aClass
{
    SEL aSelector=NSSelectorFromString([self objcMessageName]);
    const char *typeCode=NULL;
    int ivarOffset = (int)[ivarDef offset];
    IMP getterImp=NULL;
    switch ( ivarDef.objcTypeCode ) {
        case 'd':
        case '@':
            typeCode = "v@:@";
            void (^objectSetterBlock)(id object,id arg) = ^void(id object,id arg) {
                id *p=pointerToVarInObject(id,object,ivarOffset);
                if ( *p != arg ) {
                    [*p release];
                    [arg retain];
                    *p=arg;
                }
            };
            getterImp=imp_implementationWithBlock(objectSetterBlock);
            break;
        case 'i':
        case 'l':
        case 'B':
            typeCode = "v@:l";
            void (^intSetterBlock)(id object,long arg) = ^void(id object,long arg) {
                *pointerToVarInObject(long,object,ivarOffset)=arg;
            };
            getterImp=imp_implementationWithBlock(intSetterBlock);
            break;
        default:
            [NSException raise:@"invalidtype" format:@"Don't know how to generate set accessor for type '%c'",ivarDef.objcTypeCode];
            break;
    }
    if ( getterImp && typeCode ) {
        class_addMethod(aClass, aSelector, getterImp, typeCode );
    }
    
}

At this point, profiles were starting to approach around two thirds of the time being spent in sqlite_ functions, so the optimisation efforts were starting to get into a region of diminishing returns.

Linear scan beats dictionary

One final noticeable point of obvious overhead was the (string) key to parameter index mapping, which the optimizations above had left at a NSDictionary mapping from NSString to NSNumber. As you probably know, NSDictionary isn't exactly the fastest. One idea was to replace that lookup with a MPWFastrStringTable, but that means either needing to solve the problem of fast access to NSString character data or changing the protocol.

So instead I decided to brute-force it: I store the actual pointers to the NSString objects in a C-Array indexed by the SQLite parameter index. Before I do the other lookup, which I keep to be safe, I do a linear scan in that table using the incoming string pointer. This little trick largely removed the parameter index lookup from my profiles.

Conclusion

With those final tweaks, the code is probably quite close to as fast as it is going to get. Its slower performance compared to the Rust code can be attributed to the fact that it is dealing with more data and a more complex schema, as well as having to actually obtain data from materialized objects, whereas the Rust code just generates the SQlite calls on-the-fly.

All this is achieved from a slow, interpreted scripting language, with all the variable parts (data class, steering code) defined in said slow scripting language. So while I look forward to the native compiler for Objective-S, it is good to know that it isn't absolutely necessary for excellent performance, and that the basic design of these APIs is sound.

Wednesday, June 30, 2021

Don't Generate Glue...Exterminate!!

Today I saw the news of github's release of the "AI Autopilot". As far as I can tell, it's an impressive piece of engineering that shouldn't exist. I mean, "Paste Code from Stack Overflow as a Service" was supposed to be a joke, not a product spec.

As of this writing, the first example given on the product page is some code to call a REST service that does sentiment analysis, which the AI helpfully completes. For reference, this is the code:


#!/usr/bin/env ts-node

import { fetch } from "fetch-h2";

// Determine whether the sentiment of text is positive
// Use a web service
async function isPositive(text: string): Promise<boolean> {
  const response = await fetch(`http://text-processing.com/api/sentiment/`, {
    method: "POST",
    body: `text=${text}`,
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
    },
  });
  const json = await response.json();
  return json.label === "pos";
}

Here is the same script in Objective-S:
#!env stsh
#-sentiment:text
((ref:http://text-processing.com/api/sentiment/ postForm:#{ #text: text }) at:'label') = 'pos'

And once you have it, reuse it. And keep those Daleks at bay.

Monday, June 28, 2021

Generating ARM Assembly: First Steps

Finally took the plunge to start generating ARM64 assembly. As expected, the actual coding was much easier than overcoming the barrier to just start doing it.

The following snippet generates a program that prints a message to stdout, so a classic "Hello World":


#!env stsh
#-gen:msg

messageLabel ← 'message'.
main ← '_main'.

framework:ObjSTNative load
arm := MPWARMAssemblyGenerator stream 
arm global: main;
    align:2;
    label:main;
    mov:0 value:1;
    adr:1 address:messageLabel;
    mov:2 value: msg length;
    mov:16 value:4;
    svc:128;
    mov:0 value:0;
    ret;
    label:messageLabel;
    asciiz:msg.

file:hello-main.s := arm target 

One little twist is that the message to print gets passed to the generator. I like how Smalltalk's keyword syntax keeps the code uncluttered, and often pretty close to the actual assembly that will be generated.

Of particular help here is message cascading using the semicolon. This means I don't have to repeat the receiver of the message, but can just keep sending it messages. Cascading works well together with streams, because there are no return values to contend with, we just keep appending to the stream.

When invoked using ./genhello-main.st 'Hi Marcel, finish that blog post and get on your bike!', the generated code is as follows:


.global	_main
.align	2
_main:
	mov	X0, #1
	adr	X1, message
	mov	X2, #54
	mov	X16, #4
	svc	#128
	mov	X0, #0
	ret
message:
	.asciz	"Hi Marcel, finish that blog post and get on your bike!"

And now I am going to do what it says :-)