Benchmarking is an art

Prelude

For a recent project that deals with a lot of data that it needs to process.
To give you a perspective, what I mean with a lot: about 850GB of bzip2 compressed data, roughly 4.4TB of uncompressed data, 2.4 billion records.
Side-point: bzip2 is amazing at shrinking this kind of data, a very big thank you to the developers and mathematicians that came up with the algorithms.

So every clock cycle does count since it is multiplied by a lot of records. Saving 1 uS per record is almost 40 Minutes of total runtime.

Setup

So the setup is simple, I have my program to test against, that I want to optimize. I have a subset of the data and I want to find the bottleneck and optimize it. The setup is one program reading through the raw data, extracting sub-data and sending it to another program via network to the program to test.
To eliminate the CPU burden on the sender side I used netcat to store the data from the network and to replay it without much CPU load. But at the speed I'm working at, netcat still used 3-4% CPU on one core. But on a multi-core system this is no issue.
I'm using pipebench to measure the speed and time it takes to send the data, to avoid the "abuse of cat" thoughts the command is:

pipebench < data-21.json | nc 127.0.0.1 2000

The program to torture is a javascript/nodejs tool, no hate please, it is what the job needs and to spoil a bit, the v8 engine is insane fast.

Lesson 1

To benchmark the first step is consuming the data, splitting it into lines and then throwing it away. And boy, I was surprised by the speed of this. It consumed the 3GB test data with around 1.7GB/s. Quite consistent between runs (5 runs, all around 1.7 seconds +- 50mS). Needless to say, the data came from a NVMe SSD, no spinning rust.

Next step was to enable json parsing (the generated data is json).
So adding this one line slowed it down to roughly 32,49 seconds or 92,3MB/s, but this time +- 0,4 seconds per run difference.
So the issues start creeping in.

Now to add the final step, the actual use of the json data. This time the results were all over the place.
The average was 46 seconds, but the fastest was the first run with 45s but steady increasing to 47s about 0.5s per run. This confused me
so I ran it again. This time, almost identical times from the first run.
This time I noticed a steady increase of CPU fan speed. What has happened?
Since I run those tests back to back, the CPU got hotter and hotter, so it started to reduce the turbo or boost clock or whatever this is called now, down a tiny bit. Enough to mess up the benchmarks.
So this is lesson number 1: Watch the environment!
For me this meant to increase the "idle" speed of the fan and to disable the variable CPU clock frequency or add a delay between the tests to let the system cool down again.

Lesson 2

This is the old premature optimization trap.
Consider these Javascript code snippets:

var variable1 = json[0];
var variable2 = json[1];
var variable3 = json[2];
var variable4 = json[3];
doSomething(variable1, variable2, variable3, variable4)

doSomething(json[0], json[1], json[2], json[3])

When the variable1...4 is only used once and like the above.
The last one is faster, right?
Nope, both have the same speed within measuring tolerance. But the first one (if proper variable naming is used) is way more readable.
What if doSomething is a bit more complicated:

var groupKey = [variable1, variable2, variable3.somefield === undefined && variable3.someotherfield === undefined];
var groupString = groupKey.join(",");

var groupString= json[1] + "," + json[2] + (json[3].somefied === undefined && json[3].someotherfield === undefined);

or in this case, concatenating strings vs. creating a list and calling the join function:
The second version is slightly faster, but in my case 35.652s vs 36.058s or 127005 records/s vs 125575 r/s.
Without the leanings from lession 1, this would be a smaller difference than the measuring tolerance. I would not be able to see it properly.
But with the full remaining processing steps this difference can no longer be seen, so while version 2 is slightly faster on it's own, readability is king here. If I look at the code again in a year or so, I have a bigger chance to understand what the heck I was doing.

Lesson 3

Don't trust the benchmarks of other people, do your own.
In the test I noticed the biggest slow-down when enabling the json parsing. But this makes totally sense. So I looked at optimized functions.
I found some that claimed: faster than JSON.parse(). So I installed them, ran the tests and nope, they were slower. One example:
JSON.parse() took 32.436s (139598r/s) vs other implementation 38.136s (118732r/s).

Maybe it was faster at the time of the writing or with their data set, but not with mine and the recent Nodejs/v8 versions.
So find your bottlenecks, change the component and benchmark with your data.
Don't trust the stuff others claim (including this text, of course).

Summary

Do your own benchmarks, don't rely on others or your "intuition", optimizers became insane good.
Check and control your environment like temperatures to prevent thermal effects or heat soaking performance losses.
Sometimes readability has much more weight than a tiny speed improvement; And this comes from a performance expert that according to this article should not be trusted without violating the first sentence in this paragraph ;)

0 comments

Report article

have read aloud