That was using an encoder method (that writes key/value pairs to the JSON encoder) written in Objective-S, and I wondered how much faster it would go if that was no longer the case. Twice as fast, it turns out.
Yesterday, I wrote about tuning the Objective-S's SQLite insert performance to around 130M rows/minute, coincidentally
also for a simple tasks schema. One part of that performance story was the fact that the encoder method (writing key/value
pairs to the SQLite encoder) was generated by pasting together Objective-C blocks and installing the whole thing
as an Objective-C method. No interpretation, except for calling a series of blocks stored in an NSArray
.
I had completely forgotten about the hand-written Objective-S encoder method in the back-end's Task
class!
Since generation is automatic, but won't override an already existing method, all I had to do in order to get the
better performance was delete the old method.
> wrk -c 1 -t 1 http://localhost:8082/tasks
Running 10s test @ http://localhost:8082/tasks
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 66.60us 9.69us 1.08ms 96.72%
Req/Sec 14.95k 405.55 15.18k 98.02%
150275 requests in 10.10s, 30.67MB read
Requests/sec: 14879.22
Transfer/sec: 3.04MB
> curl http://localhost:8082/tasks
[{"id":1,"done":0,"title":"Clean Room"},{"id":2,"done":1,"title":"Check Twitter"}]%
More than twice the performance, and that while fetching two tasks instead of just one, so around 30K tasks/second! (And yes, I checked that I wasn't hitting a 404...).
So what's the performance if we actually fetch more than a minimal number of tasks? For 128 tasks, 64x more than before, it's still around 9K requests/s, so most of the time so far was per-request overhead. At this point we are serving a little over 1M tasks/s:
> wrk -c 1 -t 1 'http://localhost:8082/tasks/'
Running 10s test @ http://localhost:8082/tasks/
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 112.13us 76.17us 5.57ms 99.63%
Req/Sec 9.05k 397.99 9.21k 97.03%
90923 requests in 10.10s, 483.41MB read
Requests/sec: 9002.44
Transfer/sec: 47.86MB
If memory serves, that was around the rate we were seeing with the Wunderlist backend when we had a couple of million users, not that these are comparable in any meaningful way. For 1024 tasks there's a significant drop to slightly above 1.8K requests/s, with the task-rate almost doubling to 1.8M/s:
> wrk -c 1 -t 1 'http://localhost:8082/tasks/'
Running 10s test @ http://localhost:8082/tasks/
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 552.06us 62.77us 1.84ms 81.08%
Req/Sec 1.82k 52.95 1.89k 90.10%
18267 requests in 10.10s, 778.36MB read
Requests/sec: 1808.59
Transfer/sec: 77.06MB
UPDATE:
Of course, those larger request sizes also see a much larger increase in performance than 2x. With the old code, the 128 task case clocks in at 147 requests/s and the 1024 task case at 18 requests/s, at which point it's a 100x improvement. So gives you an idea just how slow my Objective-S interpreter is.
if i understand correctly, switching from interpreted (tuned by hand) code to autogenerated (but native, not jit-ed) code was 2x faster?
ReplyDeleteand this is using obj-c…. im curious what is the overhead of objc_msg_send() for all of this?
is it a wash (you might have to do this kind of dispatching anyways?) or if eliminated it wouldnt make much difference (not in the hot path?)
The actual difference between the Objective-S interpreted code and the generated code was more around 100x (see the case with lots of tasks serialised). The Objective-S interpreter is very slow.
ReplyDeleteThat turns into a 2x speed difference (in requests/s) for the case with only 2 tasks, which was the one I had done previously, hence the title. In that case all the other request processing overhead is more significant.
In the former case, objc_msgSend() is around 20%, in the latter 7% (more in select() and friends). String handling is still the bigger problem, and that also contributes to the message sending.
I tend not to worry too much about objc_msgSend(), since it is fairly easy (if arduous) to get rid off if it's actually monomorphic.