What am I even measuring?

22 Nov, 2025

I've been working on interpreting the results of a benchmark, recently, and, as is the nature of benchmarking interpretation, there are many questions.

A benchmark, itself, in my opinion, is simple. You get some computer, and you run some program on it, and you get some result.

So here's the benchmark I ran:

On a 11th Gen Intel® Core™ i5-11500B @ 3.30GHz, running Qubes, I made a VM with a single VCPU, running kernel 6.12.47-1.fc37 with a debian userland, I built the following program using rustc version rustc 1.91.1 (ed61e7d7e 2025-11-07).

use std::time::Instant;
use std::hint::black_box;

fn thing() {
    ()
}

fn measure_once() {
    let before = Instant::now();
    for _ in 1..1_000_000 {
	black_box(thing())
    }
    let elapsed = before.elapsed();
    let nanos = elapsed.as_nanos();
    println!("measure once");
    println!("total {} ns, per {} ns", nanos, nanos / 1_000_000);
}

fn measure_a_million() {
    let mut sum: u128 = 0;
    for _ in 1..1_000_000 {
	let before = Instant::now();
	black_box(thing());
	let elapsed = before.elapsed();
	let nanos = elapsed.as_nanos();
	sum += nanos
    }
    println!("measure every time");
    println!("total {} ns, per {} ns", sum, sum / 1_000_000);
}

fn main() {
    measure_once();
    measure_a_million();
}

In the first iteration it wasn't using black_box, but I was cross-checking the results with a friend who expressed some concern about compiler optimizations, so I took a look at the assembly with objdump --disassemble --demangle (gotta use --demangle because rust), and I saw call which suggested the function call was still there but he wasn't an assembly expert and seeing as how black_box is now in the rust standard library, it seemed wise to use (and more reliable that futzing about with volatile in C (which was an idea that got kicked around).

I ran it several times and got these results

user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14161672 us, per 14 us
measure every time
total 21823193 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 12956590 us, per 12 us
measure every time
total 21358464 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13735194 us, per 13 us
measure every time
total 21926421 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14379680 us, per 14 us
measure every time
total 20454596 us, per 20 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13689761 us, per 13 us
measure every time
total 18628914 us, per 18 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14054976 us, per 14 us
measure every time
total 19446841 us, per 19 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 7608650 us, per 7 us
measure every time
total 18857253 us, per 18 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 6396186 us, per 6 us
measure every time
total 18305147 us, per 18 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 4926804 us, per 4 us
measure every time
total 17800426 us, per 17 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14690476 us, per 14 us
measure every time
total 19997896 us, per 19 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 5699467 us, per 5 us
measure every time
total 20125913 us, per 20 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14409033 us, per 14 us
measure every time
total 21461318 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13536455 us, per 13 us
measure every time
total 21876292 us, per 21 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 8005645 us, per 8 us
measure every time
total 19597872 us, per 19 us
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 7991273 ns, per 7 ns
measure every time
total 20590361 ns, per 20 ns

And then I thought: So there is an upper bound on measuring each iteration of 20ns? Oops. That's an average. Let's check the max.

use std::time::Instant;
use std::hint::black_box;

fn thing() {
    ()
}

fn measure_once() {
    let before = Instant::now();
    for _ in 1..1_000_000 {
	black_box(thing())
    }
    let elapsed = before.elapsed();
    let nanos = elapsed.as_nanos();
    println!("measure once");
    println!("total {} ns, per {} ns", nanos, nanos / 1_000_000);
}

fn measure_a_million() {
    let mut sum: u128 = 0;
    let mut max: u128 = 0;
    for _ in 1..1_000_000 {
	let before = Instant::now();
	black_box(thing());
	let elapsed = before.elapsed();
	let nanos = elapsed.as_nanos();
	if nanos > max {
	    max = nanos;
	}
	sum += nanos
    }
    println!("measure every time");
    println!("total {} ns, avg {} ns, max {} ns", sum, sum / 1_000_000, max);
}

fn main() {
    measure_once();
    measure_a_million();
}

And what do we see?

user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14657361 ns, per 14 ns
measure every time
total 18194896 ns, avg 18 ns, max 182951 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13054992 ns, per 13 ns
measure every time
total 20711272 ns, avg 20 ns, max 2024579 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13262997 ns, per 13 ns
measure every time
total 18100872 ns, avg 18 ns, max 1050399 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 6463544 ns, per 6 ns
measure every time
total 21107717 ns, avg 21 ns, max 2026746 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 14400337 ns, per 14 ns
measure every time
total 18630821 ns, avg 18 ns, max 622644 ns
user@opkg1:~/timebench$ ./target/debug/timebench
measure once
total 13925357 ns, per 13 ns
measure every time
total 22116502 ns, avg 22 ns, max 2186894 ns

Huh. That's interesting. The max is like 10% of the total. I wonder what those outliers are. Here are some possibilities I thought of:

The process scheduling was interrupted during the measurement window.
There's a processor cache miss on the time store or the time load.
There's a processor cache miss on an instruction
The page containing the instruction or data store has been dropped from RAM and needs to be restored from backing store.
There's a processor cache miss on a page table lookup (or more than one).
There's are cache misses on the instruction stream.
The hypervisor interrupted the kernel from being scheduled.

I wonder what the latencies of cache misses to RAM look like: Here's some information.

It seems that the processors in question has a multi-layer cache.

What can you think of? Tell me.