Benchee after_each hook impact on measurements

Hi,

The benchee documentation states:

Of course, hooks are not included in the measurements (…) Sadly there is the notable exception of too_fast_functions (…). As we need to measure their repeated invocations to get halfway good measurements before_each and after_each hooks are included there. However, to the best of our knowledge this should only ever happen on Windows (because of the bad run time measurement accuracy).

However, if I simply add a call to Logger.debug(...) to my after_each function the measurements are greatly impacted:

Without the Logger

Name                   ips        average  deviation         median         99th %
insert_agent       40.26 K       24.84 μs    ±36.17%       22.57 μs       49.31 μs
insert_agent       44.80 K       22.32 μs    ±26.08%       21.03 μs       41.00 μs
insert_agent       43.39 K       23.05 μs    ±31.96%       21.38 μs       42.45 μs

With the Logger

insert_agent       17.01 K       58.80 μs   ±353.56%       42.42 μs      292.90 μs
insert_agent       17.32 K       57.75 μs   ±249.72%       42.66 μs      321.68 μs
insert_agent       17.72 K       56.43 μs   ±246.62%       42.46 μs      284.70 μs

(showing 3 repeated executions of each scenario just to show that it is repeatable)

Any idea what I might be missing?

I’m running benchee 1.0.1 and the setup is:

Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Number of Available Cores: 4
Available memory: 7.68 GB
Elixir 1.12.2
Erlang 24.1.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: default
Estimated total run time: 7 s

Thanks for any clarification.

Given that this only takes a few μs, it looks like this is because this falls into the documented exception of functions that are too fast to accurately measure without running then repeatedly. You should probably also see a warning that this function is very fast in the output, right?

That’s what thought too, but no such warning appears and I’m running on Ubuntu and not Windows. I didn’t mention before but I’m running it inside a ExUnit test. Not sure if it might have an impact, though.

Running in ExUnit shouldn’t make a difference. Do you have the fast function warning disabled? You can do that when setting up your benchmark.

Either way, I can pretty much guarantee that this happens because that function you’re benchmarking is extremely fast. Can you drop a sleep in there for like 50ms and see what happens?

Do you have the fast function warning disabled?

No, the :fast_warning is enabled by default and I didn’t set it.

I can pretty much guarantee that this happens because that function you’re benchmarking is extremely fast

I’m benchmarking a function that:

  • Calls DateTime.utc_now() to get a timestamp;
  • Creates a record with data from before_each (strings and list of strings) and the timestamp;
  • Inserts the record it in a local in-memory mnesia table using :mnesia.write and :mnesia.transaction;
  • Creates a struct with the data from the record and returns it;

As per the metrics in my first post it is fast indeed. Do you think I should benchmark multiple calls to that function instead of individually?

Can you drop a sleep in there for like 50ms and see what happens?

I had done that even before starting this thread, but with 200ms instead of 50ms and in the after_each:

No logger and no sleep

Name                   ips        average  deviation         median         99th %
insert_agent       45.78 K       21.84 μs    ±27.15%       20.24 μs       42.27 μs
insert_agent       44.65 K       22.40 μs    ±33.99%       20.38 μs       48.54 μs
insert_agent       45.27 K       22.09 μs    ±31.76%       20.60 μs       40.96 μs

No logger but with Process.sleep(200) in the after_each

insert_agent        4.37 K      228.76 μs    ±21.05%      225.18 μs      386.85 μs

I’ll do it in the function being benchmarkd and post the results here.

Ok, that’s really clear now. It makes an impact in your measurements not because the computation happening in the hooks is being measured, but because what you’re doing in those hooks is creating work that is likely being performed asynchronously by the VM during the execution of further runs of the function being benchmarked. This logging fits that, since logging is async. If you want to test that for sure, do something like calculate a ton of fibbonaci numbers in a hook (since that doesn’t do anything outside the active process) to see that it shouldn’t affect the benchmark.

Unfortunately, benchmarking on a VM that is made for parallelism like the BEAM comes with caveats like this sometimes.

I’ll do it (add a sleep) in the function being benchmarked and post the results here.

original function

Name                   ips        average  deviation         median         99th %
insert_agent       41.26 K       24.24 μs    ±45.16%       22.18 μs       47.10 μs
insert_agent       40.71 K       24.57 μs    ±33.24%       22.46 μs       48.28 μs
insert_agent       40.64 K       24.61 μs    ±31.48%       22.35 μs       49.31 μs

with Process.sleep(50)

Name                   ips        average  deviation         median         99th %
insert_agent         19.66       50.86 ms     ±0.33%       50.83 ms       51.44 ms
insert_agent         19.67       50.83 ms     ±0.21%       50.82 ms       51.19 ms
insert_agent         19.67       50.83 ms     ±0.17%       50.83 ms       51.18 ms

with Process.sleep(50) + Logger in the after_each

Name                   ips        average  deviation         median         99th %
insert_agent         19.65       50.88 ms     ±0.62%       50.87 ms       52.02 ms
insert_agent         19.64       50.91 ms     ±0.80%       50.80 ms       53.10 ms
insert_agent         19.61       50.99 ms     ±0.87%       50.89 ms       53.19 ms

The above results show that when the measured function is order of magnitudes slower (from μs to ms) then a simple call to Logger in the hook won’t make a big difference.

If you want to test that for sure, do something like calculate a ton of fibbonaci numbers in a hook (since that doesn’t do anything outside the active process) to see that it shouldn’t affect the benchmark.

Done:

Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Number of Available Cores: 4
Available memory: 7.68 GB
Elixir 1.12.2
Erlang 24.1.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: default
Estimated total run time: 14 s

Benchmarking no_after_each with input default...
Benchmarking with_cpu_heavy_after_each with input default...

##### With input default #####
Name                                ips        average  deviation         median         99th %
no_after_each                   41.38 K       24.17 μs    ±52.55%       21.61 μs       51.16 μs
with_cpu_heavy_after_each        9.78 K      102.23 μs     ±0.00%      102.23 μs      102.23 μs

Comparison: 
no_after_each                   41.38 K
with_cpu_heavy_after_each        9.78 K - 4.23x slower +78.07 μs

It does have some impact, but it’s not a direct one, otherwise the slower scenario would have resulted in latencies in the order of seconds and not microseconds. The “cpu_heavy” is my naive implementation of fibonacci calculating the first 45 numbers, which takes around 13 seconds.

So yeah, I guess we are all at the same page. Thanks a lot for your time in this thread Devon and for the Benchee library itself.

Cheers!