Sub-millisecond Timer Precision

12433412 · January 23, 2019, 12:49pm

I understand the concept of sleep or delay are for good reasons frown upon by the Erlang community. Process.send_after/4 is an excellent alternative in most cases and allows for maximum 1ms precision (so does the underlying Erlang erlang:send_after call). Yet, 1ms is an eternity for my application.

I am writing time-aware code where events must happen NOT before some point in time. The precise delay amount is not crucial, it just needs to be roughly consistent and in the range of tens of microseconds at the most. The solution should also scale easily to 10k+ concurrent timers at the beginning.

There are a few options (that come to my mind) to achieve sub-millisecond timing:

The naive one would be a dirty nif & scheduler in combination with POSIX nanosleep. There is two issues with this approach. No form of sleep is scalable. When context switch happens, there is roughly 30 microsecond lag completely defeating the nano part of nanosleep.
Using standard nif and POSIX set_time. The nif is only called once to set up the timer. The Elixir process that started the nif starts receiving messages in consistent time intervals (either from a SIGEV_SIGNAL signal handler or another pthread within the nif). With 50 microsecond delay, however, this amounts to roughly 4M reductions on the timer process.
To implement a native send_after_microseconds only this time utilizing a ring buffer. At this point, this is the solution I am the most inclined towards as it would not spam nearly as many messages.
Introduce a yet another type of scheduler to Erlang dedicated to time critical operations.

Has anyone faced a similar problem? Any hints as in efficiency or further options would be highly appreciated! Below is the preliminary code for option 2.

Cheers,

Martin

defmodule Clock do
  use GenServer
  require Logger

  @on_load :load_nifs

  def load_nifs() do
    :ok = :erlang.load_nif('priv/c/clock', 0)
  end

  def start_link(_arg) do
    GenServer.start_link(__MODULE__, :ok, name: Clock)
  end

  def send_after(pid, term, ticks) do
    GenServer.cast(Clock, {:send, pid, term, ticks})
  end

  def get_time() do
    GenServer.call(Clock, :get_time)
  end

  def init(_arg) do
    Logger.debug("starting clock")
    Process.flag(:priority, :high)
    send_every(:tick, 50)
    {:ok, {0, []}}
  end

  def handle_info(:tick, {tick, []}) do
    {:noreply, {tick + 1, []}}
  end

  def handle_info(:tick, {tick, [head | tail]}) do
    Enum.each(head, fn {pid, term} -> send(pid, {tick, term}) end)
    {:noreply, {tick + 1, tail}}
  end

  def handle_cast({:send, pid, term, ticks}, {tick, buffer}) do
    new_buffer =
      case length(buffer) - ticks do
        -1 ->
          buffer ++ [[{pid, term}]]

        rem when rem < 0 ->
          buffer ++ List.duplicate([], -1 - rem) ++ [[{pid, term}]]

        _ ->
          List.update_at(buffer, ticks, &(&1 ++ [{pid, term}]))
      end
      
    {:noreply, {tick, new_buffer}}
  end

  def handle_call(:get_time, _from, {tick, _} = status) do
    {:reply, tick, status}
  end

  defp send_every(_term, _micros) do
    raise "clock NIF library not loaded"
  end
end

dimitarvp · January 23, 2019, 1:20pm

That doesn’t answer your question in a satisfying way but at least it contains an explanation:

http://erlang.org/pipermail/erlang-questions/2007-March/025680.html

There’s an Erlang module code proposed several answers later but I am not seeing it addressing microsecond delays directly. You can always use :timer.tc and implement your own loop of course (using :erlang.yield). That’s probably your best bet for a non-NIF solution.

Outside of that, a NIF it is. But have in mind that even more real-time inclined kernels don’t guarantee perfect accuracy since several programs at once might request timers to stop at roughly the same time.

Finally, I am not very convinced you should even use Erlang / Elixir if your app has such needs.

peerreynders · January 23, 2019, 1:27pm

+1

[erlang-questions] What does “soft” real-time mean?

12433412 · January 23, 2019, 1:40pm

As to the link: That would stand in 2007, but POSIX implementation of the time module has been rewritten since and allows for much higher precision and fine-tuning.

You can always use :timer.tc and implement your own loop of course (using :erlang.yield )

Busy loop is not an option once I need 10k+ processes waiting.

Finally, I am not very convinced you should even use Erlang / Elixir if your app has such needs.

No, on the contrary, it is a perfect choice. Time awareness is one aspect of it. The ultimate reason for choosing Erlang runtime is that there can be billions of concurrent processes eventually. I need a cheap source of synchronization that does not need to be extremely precise, yet must live in sub-millisecond area.

dimitarvp · January 23, 2019, 1:58pm

I partially understand your motivation. I can’t speak for Erlang’s creators but in my eyes they opted for the lesser evil – namely being realistic they cannot offer those features due to varieties in all kernels where the BEAM must work.

If you work on a soft real-time system then I would say that Erlang/Elixir with a combination of a good NIF is your best bet. It’s true that the BEAM seems unbeaten in doing a lot of stuff concurrently and reliably.

RE: Context switches and delays, it’s inevitable.

RE: Ring buffer solution, 50/50. Sounds good but the potential for unexpected problems is high.

RE: Custom scheduler, better don’t. In my opinion anyway, not a strictly factual advice.

Sorry, can’t think of anything good enough.

12433412 · January 23, 2019, 3:52pm

RE: Custom scheduler, better don’t. In my opinion anyway, not a strictly factual advice.

IMHO: I believe the 1ms resolution stems from above mentioned soft-realtimedness. To my knowledge, the scheduler does regular checks against system time (erlang time with ns resolution) and forwards the messages that are due. I believe the 1ms was something they were able to - at least remotely - guarantee. My guess is the scheduler could go with higher resolution without such guarantees…

RE: Ring buffer solution, 50/50. Sounds good but the potential for unexpected problems is high.

As I will most likely go down this path, I will post the sources for the nif and the elixir wrapper once they are ready.

devonestes · January 24, 2019, 12:09pm

I believe the 1ms resolution is the lowest common (reliable) denominator of all the platforms the BEAM runs on. It’s possible to go use a much smaller resolution on most platforms, but not all of them, unfortunately.

massimo · January 24, 2019, 1:25pm

shameless plug

I wrote a library with the same api of :timer but with a resolution in microseconds

it’s called micro_timer

I did some investigations before writing my own module and I think the relevant snippet of the Erlang sleep implementation is this one

    int
erts_milli_sleep(long ms)
{
    if (ms > 0) {
#ifdef __WIN32__
  Sleep((DWORD) ms);
#else
  struct timeval tv;
  tv.tv_sec = ms / 1000;
  tv.tv_usec = (ms % 1000) * 1000;
  if (select(0, NULL, NULL, NULL, &tv) < 0)
      return errno == EINTR ? 1 : -1;
#endif
    }
    return 0;
}

Sleep for win32 is defined as

void Sleep(
  DWORD dwMilliseconds
);

It only accepts milliseconds

There is a win32 implementation of the select function that take microseconds, but it’s in Winsock2 API that are supported only from Windows Vista onward.

One could try to compile ERTS using a sleep function that supports microseconds, but I guess it would break a lot of existing software.

It should also be simple enough to write a NIF that supports sleeping for microseconds, once you have the sleep function, you can build all the other functionalities around it (it’s exactly what I did in my library, except the NIF part).

EDIT:
for clarity" sleep in Erlang is implemented using the timeout for receive

-spec sleep(Time) -> 'ok' when
      Time :: timeout().
sleep(T) ->
    receive
    after T -> ok
end.

the snippet I was referring to is what I believe is the low level C implementation.

garazdawi · January 24, 2019, 2:09pm

erts_milli_sleep is only used in testing and on operating systems without a monotonic time source.

What is used to sleep is either futex or WaitForSingleObject with some spinning done around it.

This is the relevant code for unix: https://github.com/erlang/otp/blob/master/erts/lib_src/pthread/ethr_event.c#L78-L174.

When sleeping in poll, timerfd_create(http://man7.org/linux/man-pages/man2/timerfd_create.2.html) is used to increase the resolution of the timer when triggered.

massimo · January 24, 2019, 2:23pm

Thanks for the clarification!
I was looking exactly for that, you saved me a lot of work!

My use case was correctness, I needed to generate exactly 60fps (or 90fps, or 120fps) and it can’t be accomplished with ms alone.

My implementation is very naive, it’s ok if you have few timers running and don’t care about wasting some CPU cycle or if you use it as a source of time, as a clock, like in MIDI sync.

12433412 · January 25, 2019, 8:04pm

Kudos for sharing your expert knowledge with us @garazdawi

Is there an example how to use timerfd_create to increase the resolution?

What would it encompass to implement the erlang:send_after(Time, Dest, Msg, Unit, Options) function where Unit is the Erlang time_unit()?

From what I gathered from etht_event.c and erl_poll.c is that it might be the matter of passing the desired ethr_sint64_t timeout value. I’m a bit confused with the the usage of both timeval and timespec supporting microseconds and nanoseconds respectively though.

My last concern is context switches and whether timerfd_create is able to deal with the issue gracefully somehow abstracting the timer into a file. My local benchmarks (along with my online research) suggest that the POSIX get_time function is highly susceptible to context switches being the main reason behind functions as nanosleep rarely being able reach anything near nanosecond precision.

But I have the feeling my understanding of the matter went in a completely wrong direction some long long time ago…

garazdawi · January 28, 2019, 9:03am

Sleeping in poll or on a futex already supports nanosecond resolution if the platform supports it. We use it when a scheduler decides that it needs to sleep a fraction of a millisecond before the timer would fire. i.e. timer should fire at 5ms, but scheduler decides to sleep at 2.545 ms.

What would need to be done is to change the resolution of the timer wheel that dispatches timeouts, and of course expose the APIs.

Regarding which API would be the best to use, I’ve not really experimented all that much. clock_gettime in virtualized environments does have problems, but when running native it usually works good enough.

I think that in general though if you need nanosecond accuracy of your timers, linux may not be the operating system to use.

massimo · January 28, 2019, 9:35am

When I was writing micro_timer first thing I tried was a C implementation of the sleep function.

I’ve tried almost everything, from setting a timeout to select and kevent to Posix timer_* functions, none of them worked reliably.

From my measurements, without spinlocks, you can’t get consistent sub-millisecond timings, to the point that waiting 1 million times for 1 microsecond (should be a second) takes almost 2 seconds.

The graph below shows the average difference between the timeout and the time it actually took to return.
The results are computed from 50 thousands samples, the timeout was set randomly from 1 to 65.535 microseconds (65ms).
The average deviation for the C implementation is 2.509ms, 11.4%.
But the worst case is 10x off.

The three implementation tested are

C uses select with a timeout
Ex0 waits timeout - 1ms and loops for a maximum of 1ms
Ex1 waits timeout - 2ms and loops for a maximum of 2ms

If you don’t need consistent absolute precision, any of the C implementations is good enough and they roughly work the same way.

12433412 · January 28, 2019, 12:33pm

The timer.c is very interesting reading indeed. I am tempted to hack the timeout facility and bump the slot length to 1μs and see what it does (besides breaking all time-related APIs ). My guess is that the timer wheel would introduce quite some load on the CPU compared to 1ms slots but I might be wrong.

I have no need for ns precision. Tens of microseconds are fine for my case. Stepping down to milliseconds would, however, introduce an inherent throughput issue as timeouts are part of nearly every operation within the app.

Again, many thanks for great pointers!

12433412 · January 28, 2019, 12:47pm

This looks like a load of context switching going on. They are usually behind the clock_gettime call. Before setting a timeout, the current time is needed to calculate the timeout timestamp (e.g. nanosleep will always call that function first). I have seen online benchmarks that revealed the strong correlation between context switching and the lag. Given you can hardware-bind and dedicate a single core to the timer process, it should be possible to achieve outstanding results.

I actually found a way around this setting an interval on the timer and handling the SIGEV_SIGNAL interrupt. The clock_gettime calls are avoided and the performance is much more consistent. Yet, such repetitive timer is rarely useful I’d guess…