Tuesday, March 27, 2012

30k entries (I), aka computers are fast

Let's assume a document storage system with an assumed maximum working set of 30K documents. Let's say we want to assign tags to documents and search based on those tags, with an average of 10 tags per document. Using the most brain-dead algorithm available, linear scan of the document entries and string comparison on the tags, what would it take to search through those documents? Could we maintain immediate feedback?

Measuring quickly on my laptop reveals that strcmp() takes around 8ns for a long matching string and 2ns for a non-match in the first character (with first character optimization). Splitting the difference and thus not taking into account that non-matches tend to be more common than matches, let's assume 5ns to compare each tag.

 5ns /tag x 10 tags / document x 30k documents = 
               50ns / document x 30K documents = 
                                      1500K ns =  
                                       1500 µs = 1.5 ms
So an approach that takes longer than, say, 2ms to do such a search can probably be improved.

Of course, we could do something slightly less thoroughly braindead and represent tags using integer, er, tags. A simple integer comparison should be less then one nanosecond, so that would drop the time to below 300 µs. With that, we could do 3000 queries per second, or 300 queries every tenth of second (the generally accepted threshold for interactive performance).

In theory, we could actually start optimizing ever so slightly by storing lists of document ids with each tag and then simply doing set operations on the document lists stored with each tag. But we don't really have to.

No comments: