Elixir Crashes in Virtualbox with monotonic time message

dfalling · October 13, 2022, 6:38pm

I’m running Elixir/Phoenix in a VirtualBox instance. After a short period of time (typically 5 minutes or so), Elixir quits with this output:

Monotonic time stepped backwards!
Previous time: 2140883699436
Current time:  2140883692221
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
zsh: IOT instruction (core dumped)  mix phx.server

I can’t find any sort of erlang crash log. What should I do to troubleshoot this crash? At the time of the crash I have plenty of memory and swap available.

VM OS: Lubuntu 22.04.1 LTS
VM Memory: 2GB
VM Swap: 2GB
Elixir: 1.14.0
Erlang: 25.0.4

kip · October 13, 2022, 6:47pm

I think you might need to report this upstream to the Erlang issue tracker. Monotonic time going backwards is definitely a case for shutting down! At a quick glance I didn’t see this issue already reported.

stevensonmt · October 13, 2022, 7:49pm

Are you able to try it in a different virtual environment to rule out this being a virtualbox time issue rather than an Erlang VM issue?
Based on this article

Erlang monotonic time. This is Erlang’s view of the OS monotonic time if available, or the VM’s own monotonic version of the system time if not available. This is the adjusted clock that is used for events, timers, and so on. Its stability makes it ideal to count a time interval. Do note that this time is monotonic, but not strictly monotonic, meaning that the clock can’t go backwards, but it can return the same value many times!

I don’t know if it is possible for VB to report a monotonic time that goes backwards, but if it can, then the VM that is setting its own monotonic time based on that “OS” monotonic time could crash if the time is not actually monotonic.

tcoopman · October 13, 2022, 7:59pm

When searching for virtualbox and monotonic time you find some issues. For example https://www.virtualbox.org/ticket/7915

It’s a very old issue. Although there might still be issues around?

dfalling · October 14, 2022, 12:46am

Switching VM’s would be a pretty big pain. I tried exporting my existing setup to VMWare but couldn’t get the image to work, and really don’t want to build this environment back up again.

I think you’re right that this is a VM issue, not Erlang. I’m guessing there isn’t a way to make Erlang more fault-tolerant to this issue? I’m not experiencing any other issues in my environment with other tooling or apps.

dfalling · October 14, 2022, 12:51am

One strange issue is I’ve found this doesn’t affect all Elixir instances. I have my server running and fired up my tests. The tests crashed during compilation due to the time, but the server kept on running no problem. Then I reran the tests and they compiled and ran ok, but the server crashed shortly afterwards. Guessing Elixir/Erlang have to be actively working on something to detect the time error?

stevensonmt · October 14, 2022, 1:32pm

Did you look at the Time Warp flags in that post I linked about Erlang time? In particular this seems maybe applicable:

+C single_time_warp This is a special hybrid mode to be used on embedded hardware when you know Erlang boots before the OS clock is synchronized. It works in two phases:

a.(with +c true) When the system boots, the monotonic clock is kept as stable as possible, but no system time adjustments are made
b. (with +c false) Same as no_time_warp

The user calls erlang:system_flag(time_offset, finalize), the Erlang system time warps once to match the OS system time, and then the clocks become equivalent to those under no_time_warp.

dfalling · October 14, 2022, 5:45pm

Aah, I’m sorry. I read your comment and quote but didn’t follow the link. That sounds exactly like what I want. I have it running right now and no crashes so far.

To summarize what I’m trying, it’s elixir --erl "+c false" -S mix phx.server. The description sounds like exactly what I want:

If the OS system time jumps backwards, the Erlang Monotonic clock stalls until the OS system time jumps back forward, which can take a while.

Erlang’s clock stalling for a bit is much better (in my dev environment) than the whole thing crashing.

Thank you! I’ll report back here if it ends up dying eventually, but it looks promising.

stevensonmt · October 14, 2022, 11:35pm

Just curious if the fix was persistent. If so, I’m wondering if someone smarter than me can explain what could be the potential pitfalls of having the BEAM clock stall at random times as dictated by the OS clock. I’m guessing log messages might end up with incorrect time stamps that could make debugging harder. Staying synchronous across multiple nodes I think might be another that is potentially only relevant for certain types of applications. Anything else?

dfalling · October 15, 2022, 9:49am

Yeah, still good. It looks like this solved it (usually it would have crashed a hundred times by now).

I’m curious about the implications as well. The clock rolling back is probably a huge red flag normally. I wonder what happens internally in Erlang’s VM when the clock is paused waiting for a successful forward tick.

stevensonmt · October 15, 2022, 2:34pm

Might be worth reporting to VirtualBox as a bug, assuming expected behavior for monotonic time in VB is similar to BEAM with no backwards ticks allowed.

derek-zhou · October 15, 2022, 3:19pm

At its core the BEAM is a preemptive multi-threaded system. So if the clock is funky then the scheduling will not be fair. If the clock warp too much then I guess some GenServer calls could even timeout.

itopiz · February 19, 2025, 11:57am

I am having the same issue.

My environment:

Host
- Windows 11 Pro 24H2
- VirtualBox 7.1.4
Guest
- NixOS 24.11.20250116.3355599 (Vicuna)
- Linux version 6.6.71
- Elixir 1.14.5 and Erlang/OTP 25 [erts-13.2.2.11]

I also tested on different VM environments (VMware Workstation 15 Pro, Proxmox) and never experienced this issue.

For reference, the issue happens even with all of the following conditions combined:

Disabled the Guest Additions Time Synchronization
Disabled Guest Additions, for good measure
Disabled NTP synchronization
Time Correction enabled with Time Warp Mode set to either No Time Warp Mode or Multi-Time Warp Mode

I enabled all the logs that I could, I cannot understand what causes this issue. For info:

Feb 19 10:42:43 fas myapp[1292]: OS monotonic time stepped backwards!
Feb 19 10:42:43 fas myapp[1292]: Previous time: 417898883425
Feb 19 10:42:43 fas myapp[1292]: Current time:  417898880929
Feb 19 10:42:43 fas systemd-coredump[1570]: Process 1292 (beam.smp) of user 399 terminated abnormally with signal 6/ABRT, processing...
Feb 19 10:42:43 fas systemd[1]: Created slice Slice /system/systemd-coredump.
Feb 19 10:42:43 fas systemd[1]: Started Process Core Dump (PID 1570/UID 0).
Feb 19 10:42:43 fas systemd-coredump[1571]: Removed old coredump core.beam\x2esmp.399.5f2f98555c0e485c811ad1241bae2bc8.3653.1739957532000000.zst.
Feb 19 10:42:43 fas systemd-coredump[1571]: Removed old coredump core.beam\x2esmp.399.5f2f98555c0e485c811ad1241bae2bc8.3903.1739957641000000.zst.
Feb 19 10:42:48 fas systemd-coredump[1571]: Process 1292 (beam.smp) of user 399 dumped core.
                                            
                                            Module crypto_callback.so without build-id.
                                            Module crypto.so without build-id.
                                            Module argon2_nif.so without build-id.
                                            Module asn1rt_nif.so without build-id.
                                            Module dyntrace.so without build-id.
                                            Module libgcc_s.so.1 without build-id.
                                            Module libstdc++.so.6 without build-id.
                                            Module libz.so.1 without build-id.
                                            Module libncursesw.so.6 without build-id.
                                            Module beam.smp without build-id.
                                            Stack trace of thread 1330:
                                            #0  0x00007fa6e969aa9c __pthread_kill_implementation (libc.so.6 + 0x92a9c)
                                            #1  0x00007fa6e9648576 raise (libc.so.6 + 0x40576)
                                            #2  0x00007fa6e9630935 abort (libc.so.6 + 0x28935)
                                            #3  0x0000000000455349 erts_exit_epilogue.cold (beam.smp + 0x55349)
                                            #4  0x0000000000539953 erts_exit_vv (beam.smp + 0x139953)
                                            #5  0x0000000000539aaf erts_exit (beam.smp + 0x139aaf)
                                            #6  0x00000000005ce0fe get_os_drift_corrected_time (beam.smp + 0x1ce0fe)
                                            #7  0x00000000005d0b39 erts_get_monotonic_time (beam.smp + 0x1d0b39)
                                            #8  0x0000000000460e40 scheduler_wait (beam.smp + 0x60e40)
                                            #9  0x0000000000470cf9 erts_schedule (beam.smp + 0x70cf9)
                                            #10 0x00007fa6a72445ad n/a (n/a + 0x0)
                                            ELF object binary architecture: AMD x86-64
Feb 19 10:42:48 fas systemd[1]: systemd-coredump@0-1570-0.service: Deactivated successfully.
Feb 19 10:42:48 fas systemd[1]: systemd-coredump@0-1570-0.service: Consumed 1.970s CPU time, 5M memory peak, 1000K read from disk.
Feb 19 10:42:48 fas systemd[1]: myapp.service: Main process exited, code=dumped, status=6/ABRT
Feb 19 10:42:48 fas systemd[1]: myapp.service: Failed with result 'core-dump'.
Feb 19 10:42:48 fas systemd[1]: myapp.service: Scheduled restart job, restart counter is at 1.
Feb 19 10:42:48 fas systemd[1]: Starting MyApp...

Considering the clear warning in the Erlang documentation (see below), I do not believe disabling time correction (i.e. +c false) is a solution.

Time correction

[…]
You typically never want to disable time correction. Previously a performance penalty was associated with time correction, but nowadays it is usually the other way around. If time correction is disabled, you probably get
bad scalability, bad performance, and bad time measurements.

Does anyone have more information on this? Has anyone reported this to Erlang and/or VirtualBox?

itopiz · February 24, 2025, 10:09am

The issue still persist when using Elixir (1.18.1) and Erlang/OTP 27 [erts-15.2].

itopiz · February 24, 2025, 11:00am

I reported the issue to the Erlang project: Elixir apps running on VirtualBox crash repetitively with `OS monotonic time stepped backwards!` · Issue #9479 · erlang/otp · GitHub

tcoopman · February 24, 2025, 11:07am

The issue is that your VM turned back in time, which erlang doesn’t like.

itopiz:

Feb 19 10:42:43 fas myapp[1292]: OS monotonic time stepped backwards!
Feb 19 10:42:43 fas myapp[1292]: Previous time: 417898883425
Feb 19 10:42:43 fas myapp[1292]: Current time:  417898880929

After reading a bit more on this, I also found this which says disabling time correction is the solution: Passing erl options to erlc? - #7 by rickard - Questions / Help - Erlang Programming Language Forum - Erlang Forums

Reading the time and correction module in Erlang was not entirely clear, but I think all the time warping is about the erlang monotonic time and not the OS monotonic time.

itopiz · February 28, 2025, 7:36am

Issue reported on VirtualBox Forums: Erlang crashes due to: OS monotonic time stepped backwards! - virtualbox.org

itopiz · March 12, 2025, 10:48am

Issue also reported on the VirtualBox bug tracker: #22334 (VM clock goes backwards) – Oracle VirtualBox

itopiz · March 20, 2025, 7:33am

This was indeed an issue with VirtualBox and it has been fixed (#22334 (VM clock goes backwards => fixed in svn) – Oracle VirtualBox). Waiting for a new VirtualBox release now.

dfalling · March 20, 2025, 9:57pm

Thank you for reporting this to them! I should have done that when I first hit this to save us all some grief.