metablog: Beyond Faster JSON Support for iOS/macOS, Part 9: CSV and SQLite

When looking at the MPWPlistStreaming protocol that I've been using for my JSON parsing series, one thing that was probably noticeable is that it isn't particularly JSON-focused. In fact, it wasn't even initially designed for parsing, but for generating.

So could we use this for other de-serialization tasks? Glad you asked!

CSV parsing

One of the examples in my performance book involves parsing Comma Separated Values quickly, within the context of getting the time to convert a 139Mb GTFS file to something usable on the phone down from 20 minutes using using CoreData/SQLite to slightly less than a second using custom in-memory data structures that are also several orders of magnitude faster to query on-device.

The original project's CVS parser took around 18 seconds, which wasn't a significant part of the 20 minutes, but when the rest only took a couple of hundred milliseconds, it was time to make that part faster as well. The result, slightly generalized, is MPWDelimitedTable ( .h .m ).

The basic interface is block-based, with the block being called for every row in the table, called with a dictionary composed of the header row as keys and the contents of the row as values.


-(void)do:(void(^)(NSDictionary* theDict, int anIndex))block;

Adapting this to the MPWPlistStreaming protocol is straightforward:


-(void)writeOnBuilder:(id )builder
{
    [builder beginArray];
    [self do:^(NSDictionary* theDict, int anIndex){
        [builder beginDictionary];
        for (NSString *key in self.headerKeys) {
            [builder writeObject:theDict[key] forKey:key];
        }
        [builder endDictionary];
    }];
    [builder endArray];
}

This is a quick-and-dirty implementation based on the existing API that is clearly sub-optimal: the API we call first constructs a dictionary from the row and the header keys and then we iterate over it. However, it works with our existing set of builders and doesn't build an in-memory representation of the entire CSV.

It will also be relatively straightforward to invert this API usage, modifying the low-level API to use MPWPlistStreaming and then creating a higher-level block- and dictionay-based API on top of that, in a way that will also work with other MPWPlistStreaming clients.

SQLite

Another tabular data format is SQL data bases. On macOS/iOS, one very common database is SQLite, usually accessed via CoreData or the excellent and much more light-weight fmdb.

Having used fmdb myself before, and bing quite delighted with it, my first impulse was to write a MPWPlistStreaming adapter for it, but after looking at the code a bit more closely, it seemed that it was doing quite a bit that I would not need for MPWPlistStreaming.

I also think I saw the same trade-off between a convenient and slow convenience based on NSDictionary and a much more complex but potentially faster API based on pulling individual type values.

So Instead I decided to try and do something ultra simple that sits directly on top of the SQLite C-API, and the implementation is really quite simple and compact:


@interface MPWStreamQLite()

@property (nonatomic, strong) NSString *databasePath;

@end

@implementation MPWStreamQLite
{
    sqlite3 *db;
}

-(instancetype)initWithPath:(NSString*)newpath
{
    self=[super init];
    self.databasePath = newpath;
    return self;
}

-(int)exec:(NSString*)sql
{
    sqlite3_stmt *res;
    int rc = sqlite3_prepare_v2(db, [sql UTF8String], -1, &res, 0);
    @autoreleasepool {
        [self.builder beginArray];
        int step;
        int numCols=sqlite3_column_count(res);
        NSString* keys[numCols];
        for (int i=0;i < numCols; i++) {
            keys[i]=@(sqlite3_column_name(res, i));
        }
        while ( SQLITE_ROW == (step = sqlite3_step(res))) {
            @autoreleasepool {
                [self.builder beginDictionary];
                for (int i=0; i < numCols; i++) {
                    const char *text=(const char*)sqlite3_column_text(res, i);
                    if (text) {
                        [self.builder writeObject:@(text) forKey:keys[i]];
                    }
                }
                [self.builder endDictionary];
            }
        }
        sqlite3_finalize(res);
        [self.builder endArray];
    }
    return rc;
}

-(int)open
{
    return sqlite3_open([self.databasePath UTF8String], &db);
}

-(void)close
{
    if (db) {
        sqlite3_close(db);
        db=NULL;
    }
}

Of course, this doesn't do a lot, chiefly it only reads, no updates, inserts or deletes. However, the code is striking in its brevity and simplicity, while at the same time being both convenient and fast, though with still some room for improvement.

In my experience, you tend to not get all three of these properties at the same time: code that is simple and convenient tends to be slow, code that is convenient and fast tends to be rather tricky and code that's simple and fast tends to be inconvenient to use.

How easy to use is it? The following code turns a table into an array of dictionaries:


#import <MPWFoundation/MPWFoundation.h>

int main(int argc, char* argv[]) {
   MPWStreamQLite *db=[[MPWStreamQLite alloc] initWithPath:@"chinook.db"];
   db.builder = [MPWPListBuilder new];
   if( [db open] == 0 ) {
       [db exec:@"select * from artists;"];
       NSLog(@"results: %@",[db.builder result]);
       [db close];
   } else {
       NSLog(@"Can't open database: %s\n", [db error]);
   }
   return(0);
}

This is pretty good, but probably roughly par for the course for returning a generic data structure such as array of dictionaries, which is not going to be particularly efficient. (One of my first clues that CoreData's predecessor EOF wasn't particularly fast was when I read that fetching raw dictionaries was an optimization, much faster than fetching objects.)

What if we want to get objects instead? Easy, just replace the MPWPListBuilder with an MPWObjectBuilder, parametrized with the class to create. Well, and define the class, but presumably you already havee that if the task is to convert to objects of that class. And it cold obviously also be automated.



#import <MPWFoundation/MPWFoundation.h>

@interface Artist : NSObject { }

@property (assign) long ArtistId;
@property (nonatomic,strong) NSString *Name;

@end

@implementation Artist

-(NSString*)description
{
	return [NSString stringWithFormat:@"<%@:%p id: %ld name: %@>",[self class],self,self.ArtistId,self.Name];
}

@end

int main(int argc, char* argv[]) {
   MPWStreamQLite *db=[[MPWStreamQLite alloc] initWithPath:@"chinook.db"];
   db.builder = [[MPWObjectBuilder alloc] initWithClass:[Artist class]];
   if( [db open] == 0) {
       [db exec:@"select * from artists"];
       NSLog(@"results: %@",[db.builder result]);
       [db close];
   } else {
       NSLog(@"Can't open database: %s\n", [db error]);
   }
   return(0);
}

Note that this does not generate a plist representation as an intermediate step, it goes straight from database result sets to objects. The generic intermediate "format" is the MPWPlistStreaming protocol, which is a dematerialized representation, both plist and objects are peers.

metablog

Sunday, June 21, 2020

Beyond Faster JSON Support for iOS/macOS, Part 9: CSV and SQLite

CSV parsing

SQLite

TOC

No comments:

Post a Comment