Saturday, September 22, 2007

Message-oriented persistence

The good folks at Omni posted an interesting discussion of their persistence strategy for OmniFocus. In short they found that using a database, specifically a CoreData data store, was not exactly ideal for their primary public data format.

Instead, they appear to be using a pattern that Martin Fowler calls EventPoster. After reading David Reed's thesis, I think I prefer to call it message-oriented persistence.

I first stumbled on this pattern when designing a replacement for a feed processor at the BBC. The basic task was to process a feed of information snippets encoded as XML and generate/update web and interactive TV (Ceefax) pages.

Like a good little enterprise architect, and similar to the existing system, I had originally planned to use a central SQL database for storage, though designing a data model for that was proving difficult due to the highly irregular nature of the feed data. As an auditing/logging measure, I also wanted to keep a copy of the incoming feed data, so when the time came to do the first implementation spikes, I decided we would implemented the display, update and XML feed processing logic, but not the datastore. Instead, we would just re-play the feed data from the log we had kept.

This worked so well that we never got around to implementing the central database.

Leaving out the database vastly simplified both our code-base and the deployed system, which I could run in its entirety on my 12" AlBook whereas the system we were replacing ran around a dozen networked machines. Apart from making us popular with our sysadmin team both in terms of reliability and deployment/maintenance complexity (essentially a jar and a working directory was all it needed), a fringe benefit was being able to work on the system on said AlBook while disconnected from the network, working from home or from a sunny patch of grass outside the office.

In addition to personal happiness, systen performance was also positively affected: since we kept our working state completely in memory, the AlBook mentioned outperformed the original cluster by 2-3 orders of magnitude, producing hundreds of pages per second versus taking from several seconds to several minutes to produce a single page.

Performance and simplicity are exactly the benefits claimed for prevlayer, a persistence layer for Java based on the same principles.

TeaTime, the theoretical foundation and actual engine working underneath Croquet, takes this idea to a whole different level: objects do not have state, but are simply names for message histories. This is truly "objects [and messages] all the way down". Turtles need not apply.

8 comments:

Anonymous said...

The GOF "Command Pattern" is made partly unneccessary by Objective-C's built in messaging capabilities. GOF describe the technique of storing commands to form a persistent log of changes. Indeed, Cocoa's built in undo/redo uses the approach. This is actually a well established use of the pattern.

Anonymous said...

In my experience, the pivot point between the two approaches comes down to whether or not the message history is overkill for encoding the state. I'm working on a game, for example, and started off simply storing the moves, but it's becoming intractable to run through thousands of them for large/long games just to get to a current state that could just as easily be represented in a few bytes. Compare things like chess notations for a whole game vs. a board setup.

Saving change messages can make a lot of sense if it has some use beyond just simple persistence, though. Storage has gotten so cheap that there really isn't any reason not to save the entire version history of a 8K text file. The trick becomes making retrieval more efficient than running through the entire edit process.

Also, it may be easier to maintain a persistent state than a persistent history. As much as I'd like to moan about the difficulties of versioning objects stored in a database, consider the flip side problem of trying to version all the methods in your program! Given all the changes and fixes that creep into a program over time, there is something to be said for persisting the state of what an object is.

Stefan Huy said...

Sounds like the technique the PICT format used long ago, and is basically a good solution once you have a use case for reusing and/or keeping together the associated snippets that lead to your published state.

I used a similar technique in a JS/HTML framework for storing layout and the interesting part is how quickly you find yourself or how consequently can keep yourself from "optimizing" and therfor limiting the messages to your use case again by structuring/streamlining them into an internal, use case specific format. Which, OTH, allows your to cut versioning dependencies (eg. on the external frameworks you are using).

Marcel Weiher said...

Yes, the approach is not new, David Reed's thesis was published in 1978. State is a cache for the message-histories, allowing the computation to quickly resume at a specific point in time (especially: right now!). All the systems I mention have such caches, otherwise they would probably not be practical. However, they do regard that state as just that: a cache. The semantics are carried by the messages.

This change in focus from state/objects to messages also implies that you need to design that messaging interface with great care, you can't just dump any random message sent to your objects. Well, you can, and Smalltalk and Objective-C even make it technically easy.

Essentially, you want to design the messaging API with as much care as if you were using the command pattern, though obviously you don't need to create all the nonsense command classes in a language like Smalltalk or Objective-C.

In the BBC system, the messages were defined externally by the XML feeds. In Croquet, only select messages are persisted and replicated. In OmniFocus, specific changes are encoded as XML.

Anonymous said...

Where I see problems creeping in is that, in fact, it is often not the case that "semantics are carried by the messages". As bugs get fixed or features added and dropped over time, the very nature of what a message means changes over time. Consider something as simple as a setName: method that had changing validation requirements; what do you do with messages that no longer make sense and what cascade effects result from that decision? There is only so much you can break out of code before turning the persistence history into a scripting language that does run-time modification itself! Shades of re-implementing LISP, for better or worse.

Marcel Weiher said...

Having the messages be semantically meaningful is as much a requirement as anything else, and they need to be semantically meaningful at a fairly high level. Something like -setName: is not a semantically meaningful message in this sense, as it is manipulating state directly (even if wrapped in an accessor method).

Anonymous said...

It's sounds more impossibly high than "fairly high". I just don't see how a complex system evolving over time can be expected to maintain the exact same persistence engine forever. If you classify a simple setter as unmeaningful, it strikes me that a message to do something more complex is going to be less meaningful! Like I said, the more you move to offload semantics into the persistence engine, the more you move towards it being like a language/library/script unto itself. It's not necessarily a bad thing, but it is definitely a trade-off for people to consider before they abandon state-based persistence.

Marcel Weiher said...

When the going gets impossible, the impossible get going? Seriously, this isn't wild speculation, systems like this have been built, and there is solid theory to back it up (see the references).

There is no 'off-loading' of semantics to the 'persistence-engine'. That is what OR mappers for example tend to do, whereas this is exactly the opposite: your semantics take center stage, whereas the 'persistence engine' almost vanishes.