Friday, March 30, 2012

Why Performance Matters for Testing

While speed certainly matters in a production/e-commerce environment, I think it also is a significant factor for unit testing and especially TDD. While my tests haven't been tuned and certainly aren't the fastest in the world, my frameworks do test themselves in around 3 seconds using at present pretty much exactly 1000 tests.

That level of performance makes testing qualitatively different from having tests that run in 7 minutes down from 15 after some serious performance tuning described in the article, or 5 minutes down from 10. It means running the unit tests can be a normal part of edit-compile-run cycle, rather than a separate activity, supporting the idea that the tests are simply an intrinsic part of the code.

These are not lightweight tests, for example setting up and tearing down Postscript interpreters or running PDF and Postscript documents through text extraction engines. However, they do run in memory almost exclusively and mostly in a compiled language.

However, even 3 seconds is still too long a delay, feedback should be instantaneous to give a true interactive programming experience, or at least not get in the way of that experience. I saw a nice approach to this at the Hasso Plattner Institute using code coverage analysis to interactively run tests sorted by relevance to the code being edited (in Smalltalk). A simpler approach might be to just run the unit tests in the background while editing.

Wednesday, March 28, 2012

Three Kinds of Agile

Wiliam Edwards writes that Agile Is a Sham. He got quite a few intelligent comments, including those pointing out that the values he espouses are exactly those from the Agile Manifesto:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

In my working life, I have encountered "Agile" three times. The last time was at a startup based in Oakland. It was Scrum, it was imposed from the top, a tool was introduced first in order to do agile tracking and planning, none of the actual technical practices were ever seriously considered. In short, exactly the kind of sham that Wiliam complains about. It had zero benefit.

Before that, we had two groups at BBC News Interactive starting to do their respective interpretations of XP. One did the simplest thing that could possibly work, for example never getting around to ever putting in the database, and did that in a test-first style. It didn't do standup meetings, paired at times, and every once in a while did a planning game. This team replaced a system that failed almost every single day and had abysmal performance using a 12 computer setup with a system that had two failures in 3 years, performed 100-1000 times better using only a single machine. The second team did standup meetings and planning games, but never managed to implement the technical practices such as TDD or DTSTTCPW/YAGNI. This team failed to deliver and the project was canceled/reset.

Finally, I first encountered what I would now recognize as XP before I knew that XP existed. Hard core YAGNI/DTSTTCPW, mixing pairing and working alone both organically and usefully. We introduced unit testing later and I became an immediate convert, having been responsible for shipping this product for some time and feeling the uncertainty over the question "when will it be ready". I later discovered that a bundle of these practices that we as coders had discovered for ourselves were known as XP, and I eagerly learned what I could from their experience.

So is Agile a sham? Maybe the dadaist manifesto put it best: if you are against this manifesto, you are a dadaist!

Tuesday, March 27, 2012

CocoaHeads Berlin Performance Talk

Last Wednesday, March 21st 2012, I held a talk on Cocoa Performance at the monthly Berlin CocoaHeads meeting. Thanks to everyone for the kind reception and benignly overlooking the fact that the end of the talk was a bit hurried.

Although the slides were primarily supportive and do not necessarily stand on their own, they are now available online by popular request (2.3MB, Keynote).

And a PDF version (6.3MB).

30k requests/s, aka wrk is fast

I just discovered wrk, a small and very fast http load testing tool. My previous experiments with µhttp-based MPWSideWeb first maxed out at around 5K requests per second, and after switching to httperf, I got up to around 10K per second. Using wrk on the same machine as previously, I now get this:
marcel@nomad[~]wrk  -c 100 -r 300000 http://localhost:8082/hi
Making 300000 requests to http://localhost:8082/hi
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.60ms    2.51ms  11.47ms   62.40%
    Req/Sec    14.98k     0.99k   16.00k    52.80%
  300002 requests in 9.71s, 21.46MB read
Requests/sec:  30881.96
Transfer/sec:      2.21MB
marcel@nomad[~]curl http://localhost:8082/hi
Hello World!

So a nice 30K requests per second on a MacBook Air, and wrk was only running at around 60% CPU, whereas httperf tended to be pegged at 100%. The web-server in question is a minimal server like sinatra.rb set up to return a simple "Hello world!".
marcel@localhost[scripts]cat memhttpserver.stsh 
context loadFramework:'MPWSideWeb'
server := MPWHTTPServer new.
server setPort: 8082.
stdout println: 'memhttpserver listening on port ',server port stringValue.
server start:nil.

scheme:base := MPWSiteMap scheme.
base:/hi := 'Hello World!'.
server setDelegate: scheme:base .

shell runInteractiveLoop

memhttpserver listening on port 8082

So I definitely have a new favorite http performance tester!

30k entries (II), aka computers have RAM, and they can do I/O, too...

Let's assume a document storage system with an assumed maximum working set of 30K documents. Let's also assume we want to store some tags, maybe 10 per document, encoded as 32 bit integers (8 bits tag type, 24 bits tag value used as an index). That would be:
    30K documents x 10 tags/document x 4 bytes/tag = 
                           300K tags x 4 bytes/tag =
                                      1200 K bytes = 1.2 MB
Even assuming 2:1 bloat due to to overhead gives us 2.4 MB, which should not just fit comfortably into the RAM of a modern computer or a cellphone, it actually fits comfortably into the L3 cache of an Intel Core i7 with 8-10MB to spare.

What about getting that data into RAM? The slowest hard drives (non-SSD) I could find using a quick web search had a transfer rate of better than 48MB/s and a seek time of around 10ms, so the 2.4MB in question should be in memory in around:

 10ms + 2.4MB / (48MB/s) = 
           10ms + 0.05 s =
           10ms +  50 ms =  60 ms
So less than 1/10th of a second to read it in, and a moderately fast SSD reduces that to 10ms.

EDIT: fixed embarrassing typo (L1 -> L3 cache).

30k entries (I), aka computers are fast

Let's assume a document storage system with an assumed maximum working set of 30K documents. Let's say we want to assign tags to documents and search based on those tags, with an average of 10 tags per document. Using the most brain-dead algorithm available, linear scan of the document entries and string comparison on the tags, what would it take to search through those documents? Could we maintain immediate feedback?

Measuring quickly on my laptop reveals that strcmp() takes around 8ns for a long matching string and 2ns for a non-match in the first character (with first character optimization). Splitting the difference and thus not taking into account that non-matches tend to be more common than matches, let's assume 5ns to compare each tag.

 5ns /tag x 10 tags / document x 30k documents = 
               50ns / document x 30K documents = 
                                      1500K ns =  
                                       1500 µs = 1.5 ms
So an approach that takes longer than, say, 2ms to do such a search can probably be improved.

Of course, we could do something slightly less thoroughly braindead and represent tags using integer, er, tags. A simple integer comparison should be less then one nanosecond, so that would drop the time to below 300 µs. With that, we could do 3000 queries per second, or 300 queries every tenth of second (the generally accepted threshold for interactive performance).

In theory, we could actually start optimizing ever so slightly by storing lists of document ids with each tag and then simply doing set operations on the document lists stored with each tag. But we don't really have to.