What am I even measuring?
I've been working on interpreting the results of a benchmark, recently, and, as is the nature of benchmarking interpretation, there are many questions.
A benchmark, itself, in my opinion, is simple. You get some computer, and you run some program on it, and you get some result.
So here's the benchmark I ran:
On a 11th Gen Intel® Core™ i5-11500B @ 3.30GHz, running Qubes, I made a VM with a single VCPU, running kernel 6.12.47-1.fc37 with a debian userland, I built the following program using rustc version rustc 1.91.1 (ed61e7d7e 2025-11-07).
use std::time::Instant;
use std::hint::black_box;
fn thing() {
()
}
fn measure_once() {
let before = Instant::now();
for _ in 1..1_000_000 {
black_box(thing())
}
let elapsed = before.elapsed();
let nanos = elapsed.as_nanos();
println!("measure once");
println!("total {} ns, per {} ns", nanos, nanos / 1_000_000);
}
fn measure_a_million() {
let mut sum: u128 = 0;
for _ in 1..1_000_000 {
let before = Instant::now();
black_box(thing());
let elapsed = before.elapsed();
let nanos = elapsed.as_nanos();
sum += nanos
}
println!("measure every time");
println!("total {} ns, per {} ns", sum, sum / 1_000_000);
}
fn main() {
measure_once();
measure_a_million();
}
In the first iteration it wasn't using black_box, but I was cross-checking the results with a friend who expressed some concern about compiler optimizations, so I took a look at the assembly with objdump --disassemble --demangle (gotta use --demangle because rust), and I saw call which suggested the function call was still there but he wasn't an assembly expert and seeing as how black_box is now in the rust standard library, it seemed wise to use (and more reliable that futzing about with volatile in C (which was an idea that got kicked around).
I ran it several times and got these results
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14161672 us, per 14 us
measure every time
total 21823193 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 12956590 us, per 12 us
measure every time
total 21358464 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13735194 us, per 13 us
measure every time
total 21926421 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14379680 us, per 14 us
measure every time
total 20454596 us, per 20 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13689761 us, per 13 us
measure every time
total 18628914 us, per 18 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14054976 us, per 14 us
measure every time
total 19446841 us, per 19 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 7608650 us, per 7 us
measure every time
total 18857253 us, per 18 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 6396186 us, per 6 us
measure every time
total 18305147 us, per 18 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 4926804 us, per 4 us
measure every time
total 17800426 us, per 17 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14690476 us, per 14 us
measure every time
total 19997896 us, per 19 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 5699467 us, per 5 us
measure every time
total 20125913 us, per 20 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14409033 us, per 14 us
measure every time
total 21461318 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13536455 us, per 13 us
measure every time
total 21876292 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 8005645 us, per 8 us
measure every time
total 19597872 us, per 19 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 7991273 ns, per 7 ns
measure every time
total 20590361 ns, per 20 ns
And then I thought: So there is an upper bound on measuring each iteration of 20ns? Oops. That's an average. Let's check the max.
use std::time::Instant;
use std::hint::black_box;
fn thing() {
()
}
fn measure_once() {
let before = Instant::now();
for _ in 1..1_000_000 {
black_box(thing())
}
let elapsed = before.elapsed();
let nanos = elapsed.as_nanos();
println!("measure once");
println!("total {} ns, per {} ns", nanos, nanos / 1_000_000);
}
fn measure_a_million() {
let mut sum: u128 = 0;
let mut max: u128 = 0;
for _ in 1..1_000_000 {
let before = Instant::now();
black_box(thing());
let elapsed = before.elapsed();
let nanos = elapsed.as_nanos();
if nanos > max {
max = nanos;
}
sum += nanos
}
println!("measure every time");
println!("total {} ns, avg {} ns, max {} ns", sum, sum / 1_000_000, max);
}
fn main() {
measure_once();
measure_a_million();
}
And what do we see?
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14657361 ns, per 14 ns
measure every time
total 18194896 ns, avg 18 ns, max 182951 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13054992 ns, per 13 ns
measure every time
total 20711272 ns, avg 20 ns, max 2024579 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13262997 ns, per 13 ns
measure every time
total 18100872 ns, avg 18 ns, max 1050399 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 6463544 ns, per 6 ns
measure every time
total 21107717 ns, avg 21 ns, max 2026746 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14400337 ns, per 14 ns
measure every time
total 18630821 ns, avg 18 ns, max 622644 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13925357 ns, per 13 ns
measure every time
total 22116502 ns, avg 22 ns, max 2186894 ns
Huh. That's interesting. The max is like 10% of the total. I wonder what those outliers are. Here are some possibilities I thought of:
- The process scheduling was interrupted during the measurement window.
- There's a processor cache miss on the time store or the time load.
- There's a processor cache miss on an instruction
- The page containing the instruction or data store has been dropped from RAM and needs to be restored from backing store.
- There's a processor cache miss on a page table lookup (or more than one).
- There's are cache misses on the instruction stream.
- The hypervisor interrupted the kernel from being scheduled.
I wonder what the latencies of cache misses to RAM look like: Here's some information.
It seems that the processors in question has a multi-layer cache.
What can you think of? Tell me.