Saturday, July 24, 2021

Inserting 130M SQLite Rows per Minute...from a Scripting Language

The other week, I stumbled on the post Inserting One Billion Rows in SQLite Under A Minute, which was a funny coincidence, as I was just in the process of giving my own SQLite/Objective-S adapter a bit of tune-up. (The post's title later had "Towards" prepended, because the author wasn't close to hitting that goal).

This SQLite adapater was a spin-off of my earlier article series on optimizing JSON performance, itself triggered by the ludicrously bad performance of Swift Coding at this rather simple and relevant task. To recap: Swift's JSON coder clocked in at about 10MB/s. By using a streaming approach and a bit of tuning, we got that to around 200MB/s.

Since then, I have worked on making Objective-S much more useful for UI work, with the object-literal syntax making defining UIs as convenient as the various "declarative" functional approaches such as React or SwiftUI. Except it is still using the same AppKit or UIKit objects we know and love, and doesn't force us to embrace the silly notion that the UI is a pure function of the model. Oh, and you get live previews that actually work. But more on that later.

So I am slowly inching towards doing a ToDoMVC, a benchmark that feels rather natural to me. While I am still very partial to just dumping JSON files, and the previous article series hopefully showed that this approach is plenty fast enough, I realize that a lot of people prefer a "real" database, especially on the back-end, and I wanted to build that as well. One of the many benchmarks I have for Objective-S is that it should be possible to build a nicer Rails with it. (At this point in time I am pretty sure I will hit that benchmark).

One of the ways to figure out if you have a good design is to stress-test it. One very useful stress-test is seeing how fast it can go, because that will tell you if the thing you built is lean, or if you put in unnecessary layers and indirections.

This is particularly interesting in a Scripted Components (pdf) system that combines a relatively slow but flexible interactive scripting language with fast, optimized components. The question is whether you can actually combine the flexibility of the scripting language while reaping the benefits of the fast components, rather than having to dive into adapting and optimizing the components for each use case, or just getting slow performance despite the fast components. My hunch was that the streaming approach I have been using for a while now and that worked really well for JSON and Objective-C would also do well in this more challenging setting.

Spoiler alert: it did!

The benchmark

The benchmark was a slightly modified version of the script that serves as a tasks backend. Like said sample script it also creates a tasks database and inserts some example rows. Instead of inserting two rows, it inserts 10 million. Or a hundred million.


#!env stsh
#-taskbench:dbref
#

class Task {
	var  id.
	var  done.
	var  title.
	-description { "". }
	+sqlForCreate {
		'( [id] INTEGER PRIMARY KEY, [title] VARCHAR(220) NOT NULL, [done] INTEGER );'.
	}
}.

scheme todo : MPWAbstractStore {
	var db.
	var tasksTable.
	-initWithRef:ref {
		this:db := (MPWStreamQLite alloc initWithPath:ref path).
		this:tasksTable :=  #MPWSQLTable{ #db: this:db , #tableClass: Task, #name: 'tasks'  }.
		this:db open.
		self.
	}
	-createTable {
		this:tasksTable create.
	    this:tasksTable := this:db tables at:'tasks'.
        this:tasksTable createEncoderMethodForClass: Task.
	}
	-createTaskListToInsert:log10ofSize {
		baseList ← #( #Task{  #title: 'Clean Room', #done: false }, #Task{  #title: 'Check Twitter', #done: true } ).
		...replicate ...
		taskList.
	}
	-insertTasks {
	    taskList := self createTaskListToInsert:6.
		1 to:10 do: {
			this:tasksTable insert:taskList.
		}.
	}
}.
todo := todo alloc initWithRef:dbref.
todo createTable.
todo insertTasks.

(I have removed the body of the method that replicates the 2 tasks into the list of millions of tasks we need to insert. It was bulky and not relevant.)

In this sample we define the Task class and use that to create the SQL Table. We could also have simply created the table and generated a Tasks class from that.

Anyway, running this script yields the following result.


> time ./taskbench-sqlite.st /tmp/tasks1.db 
./taskbench-sqlite.st /tmp/tasks1.db  4.07s user 0.20s system 98% cpu 4.328 total
> ls -al  /tmp/tasks1.db* 
-rw-r--r--  1 marcel  wheel   214M Jul 24 20:11 /tmp/tasks1.db
> sqlite3 /tmp/tasks1.db 'select count(id) from tasks;' 
10000000

So we inserted 10M rows in 4.328 seconds, yielding several hundred megabytes of SQLite data. This would be 138M rows had we let it run for a minute. Nice. For comparison, the original article's numbers were 11M rows/minute for CPython, 40M rows/minute for PyPy and 181M rows/minute for Rust, though on a slower Intel MacBook Pro whereas I was running this on an M1 Air. I compiled and ran the Rust version on my M1 Air and it did 100M rows in 21 seconds, so just a smidgen over twice as fast as my Objective-S script, though with a simpler schema (CHAR(6) instead of VARCHAR(220)) and less data (1.5GB vs. 2.1GB for 100M rows).

Getting SQLite fast

The initial version of the script was far, far slower, and at first it was, er, "sub-optimal" use of SQLite that was the main culprit, mostly inserting every row by itself without batching. When SQLite sees an INSERT (or an UPDATE for that matter) that is not contained in a transaction, it will automatically wrap that INSERT inside a generated transaction and commit that transaction after the INSERT is processed. Since SQLite is very fastidious about ensuring that transactions get to disk atomically, this is slow. Very slow.

The class handling SQLite inserts is a Polymorphic Write Stream, so it knows what an array is. When it encounters one, it sends itself the beginArray message, writes the contents of the array and finishes by sending itself the endArray message. Since writing an array sort of implies that you want to write all of it, this was a good place to insert the transactions:


-(void)beginArray {
    sqlite3_step(begin_transaction);
    sqlite3_reset(begin_transaction);
}

-(void)endArray {
    sqlite3_step(end_transaction);
    sqlite3_reset(end_transaction);
}

So now, if you want to write a bunch of objects as a single transaction, just write them as an array, as the benchmark code does. There were some other minor issues, but after that less than 10% of the total time were spent in SQLite, so it was time to optimize the caller, my code.

Column keys and Cocoa Strings

At this point, my guess was that the biggest remaining slowdown would be my, er, "majestic" Objective-S interpreter. I was wrong, it was Cocoa string handling. Not only was I creating the SQLite parameter placeholder keys dynamically, so allocating new NSString objects for each column of each row, it also happens that getting character data from an NSString object nowadays involves some very complex and slow internal machinery using encoding conversion streams. -UTF8String is not your friend, and other methods appear to fairly consistently use the same slow mechanism. I guess making NSString horribly slow is one way to make other string handling look good in comparison.

After a few transformations, the code would just look up the incoming NSString key in a dictionary that mapped it to the SQLite parameter index. String-processing and character accessing averted.

Jitting the encoder method. Without a JIT

One thing you might have noticed about the class definition in the benchmark code is that there is no encoder method, it just defines its instance variables and some other utilities. So how is the class data encoded for the SQLTable? KVC? No, that would be a bit slow, though it might make a good fallback.

The magic is the createEncoderMethodForClass: method. This method, as the name suggests, creates an encoder method by pasting together a number of blocks, turns the top-level into a method using imp_implementationWithBlock(), and then finally adds that method to the class in question using class_addMethod().


-(void)createEncoderMethodForClass:(Class)theClass
{
    NSArray *ivars=[theClass allIvarNames];
    if ( [[ivars lastObject] hasPrefix:@"_"]) {
        ivars=(NSArray*)[[ivars collect] substringFromIndex:1];
    }
    
    NSMutableArray *copiers=[[NSMutableArray arrayWithCapacity:ivars.count] retain];
    for (NSString *ivar in ivars) {
        MPWPropertyBinding *accessor=[[MPWPropertyBinding valueForName:ivar] retain];
        [ivar retain];
        [accessor bindToClass:theClass];
        
        id objBlock=^(id object, MPWFlattenStream* stream){
            [stream writeObject:[accessor valueForTarget:object] forKey:ivar];
        };
        id intBlock=^(id object, MPWFlattenStream* stream){
            [stream writeInteger:[accessor integerValueForTarget:object] forKey:ivar];
        };
        int typeCode = [accessor typeCode];
        
        if ( typeCode == 'i' || typeCode == 'q' || typeCode == 'l' || typeCode == 'B' ) {
            [copiers addObject:Block_copy(intBlock)];
        } else {
            [copiers addObject:Block_copy(objBlock)];
        }
    }
    void (^encoder)( id object, MPWFlattenStream *writer) = Block_copy( ^void(id object, MPWFlattenStream *writer) {
        for  ( id block in copiers ) {
            void (^encodeIvar)(id object, MPWFlattenStream *writer)=block;
            encodeIvar(object, writer);
        }
    });
    void (^encoderMethod)( id blockself, MPWFlattenStream *writer) = ^void(id blockself, MPWFlattenStream *writer) {
        [writer writeDictionaryLikeObject:blockself withContentBlock:encoder];
    };
    IMP encoderMethodImp = imp_implementationWithBlock(encoderMethod);
    class_addMethod(theClass, [self streamWriterMessage], encoderMethodImp, "v@:@" );
}

What's kind of neat is that I didn't actually write that method for this particular use-case: I had already created it for JSON-coding. Due to the fact that the JSON-encoder and the SQLite writer are both Polymorphic Write Streams (as are the targets of the corresponding decoders/parsers), the same method worked out of the box for both.

(It should be noted that this encoder-generator currently does not handle all variety of data types; this is intentional).

Getting the data out of Objective-S objects

The encoder method uses MPWPropertyBinding objects to efficiently access the instance variables via the object's accessors, caching IMPs and converting data as necessary, so they are both efficient and flexible. However, the actual accessors that Objective-S generated for its instance variables were rather baroque, because they used the same basic mechanism used for Objective-S methods, which can only deal with objects, not with primitive data types.

In order to interoperate seamlessly with Objective-C, which expected methods that can take data types other than objects, all non-object method arguments are converted to objects on the way in, and return values are converted from objects to primitive values on the way out.

So even the accessors for primitive types such as the integer "id" or the boolean "done" would have their values converted to and from objects by the interface machinery. As I noted above, I was a bit surprised that this inefficiency was overshadowed by the NSString-based key handling.

In fact, one of the reason for pursuing the SQLite insert benchmark was to have a reason for finally tackling this Rube-Goldberg mechanism. In the end, actually addressing it turned out to be far less complex than I had feared, with the technique being very similar to that used for the encoder-generator above, just simpler.

Depending on the type, we use a different block that gets parameterised with the offset to the instance variable. I show the setter-generator below, because there the code for the object-case is actually different due to retain-count handling:


#define pointerToVarInObject( type, anObject ,offset)  ((type*)(((char*)anObject) + offset))

#ifndef __clang_analyzer__
// This leaks because we are installing into the runtime, can't remove after

-(void)installInClass:(Class)aClass
{
    SEL aSelector=NSSelectorFromString([self objcMessageName]);
    const char *typeCode=NULL;
    int ivarOffset = (int)[ivarDef offset];
    IMP getterImp=NULL;
    switch ( ivarDef.objcTypeCode ) {
        case 'd':
        case '@':
            typeCode = "v@:@";
            void (^objectSetterBlock)(id object,id arg) = ^void(id object,id arg) {
                id *p=pointerToVarInObject(id,object,ivarOffset);
                if ( *p != arg ) {
                    [*p release];
                    [arg retain];
                    *p=arg;
                }
            };
            getterImp=imp_implementationWithBlock(objectSetterBlock);
            break;
        case 'i':
        case 'l':
        case 'B':
            typeCode = "v@:l";
            void (^intSetterBlock)(id object,long arg) = ^void(id object,long arg) {
                *pointerToVarInObject(long,object,ivarOffset)=arg;
            };
            getterImp=imp_implementationWithBlock(intSetterBlock);
            break;
        default:
            [NSException raise:@"invalidtype" format:@"Don't know how to generate set accessor for type '%c'",ivarDef.objcTypeCode];
            break;
    }
    if ( getterImp && typeCode ) {
        class_addMethod(aClass, aSelector, getterImp, typeCode );
    }
    
}

At this point, profiles were starting to approach around two thirds of the time being spent in sqlite_ functions, so the optimisation efforts were starting to get into a region of diminishing returns.

Linear scan beats dictionary

One final noticeable point of obvious overhead was the (string) key to parameter index mapping, which the optimizations above had left at a NSDictionary mapping from NSString to NSNumber. As you probably know, NSDictionary isn't exactly the fastest. One idea was to replace that lookup with a MPWFastrStringTable, but that means either needing to solve the problem of fast access to NSString character data or changing the protocol.

So instead I decided to brute-force it: I store the actual pointers to the NSString objects in a C-Array indexed by the SQLite parameter index. Before I do the other lookup, which I keep to be safe, I do a linear scan in that table using the incoming string pointer. This little trick largely removed the parameter index lookup from my profiles.

Conclusion

With those final tweaks, the code is probably quite close to as fast as it is going to get. Its slower performance compared to the Rust code can be attributed to the fact that it is dealing with more data and a more complex schema, as well as having to actually obtain data from materialized objects, whereas the Rust code just generates the SQlite calls on-the-fly.

All this is achieved from a slow, interpreted scripting language, with all the variable parts (data class, steering code) defined in said slow scripting language. So while I look forward to the native compiler for Objective-S, it is good to know that it isn't absolutely necessary for excellent performance, and that the basic design of these APIs is sound.

No comments:

Post a Comment