qtimeit
v0.22.2
Published
simple, accurate micro-benchmarking toolkit
Downloads
28
Maintainers
Readme
qtimeit
Simple performance profiling tool for both synchronous and asynchronous functions, usable to micro-benchmark nodejs language features.
- sensitive enough to time even really fast nodejs operations
- measures just the body of the test function, not the function call
- self-calibrating, does not include its own overhead in the results
- auto-calibrating, can run tests for a specified number of seconds
- works with both synchronous functions and functions taking a callback
- allows comparative benchmarking a suite of test functions
Calibrates and measures by repeatedly running the test function, so avoid unintended side effects.
Overview
Examples of a one-off micro benchmark and a benchmark suite
var x, y;
var timeit = require('qtimeit');
timeit(10000000, function(){ x = [1, 2, 3]; });
// => 10000000 loops in 0.1095 of 0.27 sec: 91299026.24 / sec, 0.000011 ms each
timeit.bench([
function() { x = [1, 2, 3]; },
function() { x = [1, 2, 3]; y = [4, 5, 6]; }
]);
// #1 93,468,158 ops/sec (31 runs of 5m calls in 1.658 out of 4.373 sec, +/- 2.16%) 1000
// #2 47,414,812 ops/sec (22 runs of 5m calls in 2.320 out of 4.059 sec, +/- 1.94%) 507
More examples can be found in the qtimeit overview.
Api
timeit( countOrSeconds, testFunction(), [message] )
timeit( countOrSeconds, testFunction(cb), [message,] callback )
Call the testFunction count
times, and report on its performance.
If the testFunction is a string it will be parsed into a function object.
If count
is a decimal (has a fraction), the test will be looped for that many
seconds instead. The report will include the actual number of loops run.
If a message is provided, it will be included at the start of the report line.
If a callback is provided the user-provied callback will be called after the test
has been run count
times. The testFunction itself will be invoked with a
callback that the test must call for timeit to finish.
var x, timeit = require('timeit');
timeit(10000000, function(){ x = [1, 2, 3]; });
// => 10000000 loops in 0.1095 of 0.27 sec: 91299026.24 / sec, 0.000011 ms each
timeit(10000000, function(cb) { x = [1, 2, 3]; cb() }, function(){});
// => 10000000 loops in 0.1107 of 0.52 sec: 90337752.78 / sec, 0.000011 ms each
timeit.bench( suite [,callback] )
Run each of the functions in the suite and report timings and relative throughput. The suite can be an array of functions, or an object where the properties are test functions and the property names are the test names to report on.
Bench works with both synchronous (no callback) and asynchronous (yes callback) functions.
var x, timeit = require('qtimeit');
timeit.bench([
function() { x = [1, 2, 3]; },
function() { x = [1, 2, 3]; y = [4, 5, 6]; }
]);
// node=5.10.1 arch=ia32 mhz=3500 cpu="AMD Phenom(tm) II X4 B55 Processor" up_threshold=11
// name speed (stats) rate
// #1 93,468,158 / sec (31 runs of 5m in 1.658 over 4.373s, +/- 2.16%) 1000
// #2 47,414,812 / sec (22 runs of 5m in 2.320 over 4.059s, +/- 1.94%) 507
The reported fields are the test name (#1
etc, or the property name from the suite
object), the test speed in calls / second, statistics about the test runs (count of
timeit runs, timeit nloops, seconds used by the tests, total seconds elapsed, speed
run-to-run variability), and the normalized call rate. The normalized call rate is
the relative speed rank of each test. The first test is always 1000, the other tests
are proportionately higher if they ran more calls, or lower if they ran fewer calls
per second than the first test. (E.g. above: 47,415k / 93,468k = 0.5072, ie "507"
compared to the first test's "1000".)
timeit.bench.timeGoal
How long to loop the each test before computing the average. Default 4.00 seconds.
timeit.bench.opsPerTest
How many operations are performed in each test function, for when the tests themselves loop. The number of ops/sec reported in the summary will be scaled up by this value. Default 1.
timeit.bench.cpuMhz
The processor speed to report in the platform summary line, in MHz. Qtimeit tries
to self-calibrate using /usr/bin/perf on linux systems, but calibration is not
perfect, and can under-report the speed. If calibration fails qtimeit
normally
reports the unreliable figure included in os.cpus()
.
timeit.bench.visualize
If set, align the columns and append performance bars relative to the first result.
timeit.bench.baselineAvg
If set, the baseline ops / sec rate used to normalize the relative rankings. The default is to normalize to the first test's speed, and define that rate as the nominal 1000.
timeit.bench.forkTests
If set, each test in the suite will be run in a separate process created with
child_process.fork
, with the results tabulated as usual.
Since fork re-runs the top level test script, any test preamble that should
only be output once can be placed into the timeit.bench.preRunMessage
field,
which will be written to the console before the tests are run.
timeit.bench.preRunMessage
The contents of this property are written to stdout before the tests are run.
See forkTests
for details.
timeit.bench.showRunDetails
Set to false
to omit the calls, runs and timing details from the results.
Only the name, ops/sec speed and relative rate are displayed.
timeit.bench.showTestInfo
Whether to include a line with the test configuration in the bench banner.
timeit.bench.bargraphScale
How many >>>
to print per every 1000 rank. Default 5.
timeit.fptime( )
Nanosecond-resolution floating-point timestamp from process.hrtime(). The
timestamp returned does not have an absolute meaning (on Linux, it's uptime(1)
,
the number of seconds since the last boot), but differeces between timestamps
are accurate -- a difference of 1.00 is 1 elapsed second. The overhead is as
low as .6 microseconds per call, about 3x slower than Date.now().
var fptime = require('arlib/timeit').fptime
var t1 = fptime(); // 1809688.215437152
var t2 = fptime(); // 1809688.215462518
var t3 = fptime(); // 1809688.215466353
// 25.4 usec for the first call, 3.84 for the second
// uptime of 20 days, 22:40 hours
timeit.cpuMhz( )
Measure the speed of the processor using perf stat ...
. Works on Linux, not sure
about other platforms. Returns a float eg 4522.5421
, else false
if unable to
measure.
timeit.sysinfo( )
Return the information block that is also prepended to qtimeit.bench
test runs.
This includes the node and v8 versions, the system architecture, and the cpu make,
model and speed in MHz.
Comparisons
Testing with node-v5.10.1 (which on this test is 25% faster than node-v6.2.2):
Benchmarking with qtimeit
(from above)
var x, timeit = require('qtimeit');
timeit.bench([
function() { x = [1, 2, 3]; },
function() { x = [1, 2, 3]; y = [4, 5, 6]; }
]);
// #1 93,468,158 / sec (31 runs of 5m in 1.658 over 4.373s, +/- 2.16%) 1000
// #2 47,414,812 / sec (22 runs of 5m in 2.320 over 4.059s, +/- 1.94%) 507
Benchmarking with benchmark
var x, y, benchmark = require('benchmark');
new benchmark.Suite()
.add(function() { x = [1, 2, 3]; })
.add(function() { x = [1, 2, 3]; y = [4, 5, 6]; })
.on('cycle', function(ev) {
console.log(ev.target.toString())
})
.run();
// <Test #1> x 38,281,069 ops/sec ±0.93% (92 runs sampled)
// <Test #2> x 25,866,187 ops/sec ±1.48% (93 runs sampled)
Benchmarking with bench
var x, y, bench = require('bench');
module.exports.compare = {
'one array': function() { x = [1, 2, 3]; },
'two arrays': function() { x = [1, 2, 3]; y = [4, 5, 6]; },
};
bench.runMain();
// one array
// Average (mean) 53302.44755244756
// two arrays
// Average (mean) 33285.46453546454
The two last sets of reported rates seem wrong: allocating two arrays is twice as
much work thus should run at half the speed (take twice as long) as allocating just
one. The rates are also much lower than the qtimeit.bench
-reported 93m and 47m
operations per second.
Sometimes it's possible to double-check the accuracy of the reported speeds from short scripts or even from the command line. So let's re-measure the rates with a barebones timed loop:
# node startup and loop overhead
% time node -p 'var x; for (var i=0; i<100000000; i++) ;'
0.208u 0.000s 0:00.21 95.2% 0+0k 0+0io 0pf+0w
# total time for [1,2,3]
% time node -p 'var x; for (var i=0; i<100000000; i++) x = [1,2,3];'
1.292u 0.000s 0:01.29 100.0% 0+0k 0+0io 0pf+0w
# total time for both [1,2,3] and [4,5,6]
% time node -p 'var x, y; for (var i=0; i<100000000; i++) { x = [1,2,3]; y = [4,5,6]; }'
2.348u 0.004s 0:02.35 99.5% 0+0k 0+0io 0pf+0w
# operations per second for the one array and two arrays
% echo '100000000 / (1.29 - .21)' | bc
92592592
% echo '100000000 / (2.35 - .21)' | bc
46728971
Raw: 92.59m/s. Benchmark: 38.28m/s, 142% off. Bench: 53.30m/s, 74% off. Qtimeit: 93.47m/s, 1% off.
Notes on Timing
Qtimeit tries to be careful about self-calibrating and subtracting its own overhead from the measured results. The time to invoke the test function is not included as part of the reported time, only the time to run its body. (For very fast-running function this can result in absurd or even negative rates, because node timing is affected by the state of the heap and thus not overly deterministic. Sometimes the function body may have been optimized away, so make sure the test has a side-effect so it cannot be skipped.)
Cpu Effects
To avoid potentially misleading timings, also run the test on just a single cpu.
Nodejs will at times run with multiple internal threads active that use more than
100% total cpu. This can be defeated by forcing the test to run on a single core.
On Linux this can be done with the taskset
command.
When timing, be aware that modern cpus make performance profiling tricky, because the actual cpu speed can vary and may not be known.
- by default the cpu will be in power-save (slow) mode
- it takes some amount computation to bring the cpu out of slow mode
- with just one core active, the cpu will run in turbo (extra-fast) mode
- with multiple cores active, the cpu can switch to a slower turbo mode
- some cpus can use super-turbo speeds faster than the preset turbo
- if the core temperature reaches an internal threshold, turbo mode might end
- very short one-liners that fit into the cpu cache can benchmark as much faster than when run in more realistic settings alongside other code
On Linux, these can be controlled somewhat by setting the scaling_governor to
performance
(always fast) one one of the cores, then running the test on that
core. Eg, for core 3:
$ echo -n performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
$ taskset 4 node test.js
(/sys/devices/system/cpu numbers cpus starting at 0. The taskset
core number is
a bitmask starting at 1, so core1 = 1, core2 = 2, core3 = 4, core4 = 8, etc. The
bitmask values can be added to specify any one of a set of cores.)
Nodejs Effects
Nodejs itself makes benchmarking less than straightforward
- the first
timeit()
run often reports very different numbers than a rerun, for some tests higher, for some lower. The first run finds a clean heap, but has to allocate memory from the operating system - the state of the heap affects timings, so changing the order of tests can change the results
- heap, garbage collection and function optimization/deoptimization effects can result in a large run-to-run variability
- the performance of the whole may not match that of the parts measured in isolation
- some language features (
try
/catch
,eval
, passingarguments
) inentionally disable optimization of the immediately containing function - some language quirks inadvertently turn off optimization (eg
const
in the middle of some functions) - some latent language bugs can produce function optimize / deoptimize thrashing
(eg in sometimes storing a constructor function arg into
this
vs setting it as a property on the object afterward)
Related Work
- qtimeit - this package
- benchmark - a popular benchmarking package, inaccurate
- bench - another benchmarking package
- qbson - BSON encode/decode functions whose timings prompted
timeit.bench()
ChangeLog
- 0.22.2 - tone down noisy perf error, fix error test
- 0.22.1 - fix nextTick so can run with node-v0.8, report os speed if /usr/bin/perf fails
- 0.22.0 - make visualize=true,showRunDetails=false the default
- 0.21.4 - use nextTick if no setImmediate
- 0.21.3 - fix cpuMhz to fall back to os.cpus()[i].speed if /usr/bin/perf fails
- 0.21.2 - revert 0.21.1, repeatWhile must be a synchronous call
- 0.21.1 - break up
repeatWhile
call stack, fixes heap consumption and wrong fractional bench durations - 0.21.0 -
bench.bargraphScale
control - 0.20.0 - chop up repeatWhile call stack if used async, run test loops at least once, guard against NaN rank, do not draw overlong rank bars, show just 3 sig figs
- 0.19.0 - faster and more accurate timed loop calibration, better overhead calibration, more consistent results, much more accurate timed test durations
- 0.18.0 -
bench.forkTests
,bench.preRunMessage
settings, back-filled this changelog - 0.17.0 -
bench.baselineAvg
setting - 0.16.0 -
bench.visualize
setting - 0.15.0 -
timeit.sysinfo()
andtimeit.cpuMhz()
- 0.14.0 - time and report actual cpu mhz
- 0.13.0 -
bench.opsPerTest
setting - 0.12.1 -
bench.timeGoal
setting - 0.11.5 -
qtimeit.bench
benchmark runner for sync and async tests
Todo
- tiny command line one-liners can run out of cache and give inflated results
- need a way to force deoptimization so can time deoptimized version too
- measure and report on mem usage (might be enough to check on each return?)
- iff installed, measure max rss with
qrusage
, gc activity withgc-stats