I’m running Elixir/Phoenix in a VirtualBox instance. After a short period of time (typically 5 minutes or so), Elixir quits with this output:
Monotonic time stepped backwards!
Previous time: 2140883699436
Current time: 2140883692221
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
zsh: IOT instruction (core dumped) mix phx.server
I can’t find any sort of erlang crash log. What should I do to troubleshoot this crash? At the time of the crash I have plenty of memory and swap available.
VM OS: Lubuntu 22.04.1 LTS
VM Memory: 2GB
VM Swap: 2GB
Elixir: 1.14.0
Erlang: 25.0.4
I think you might need to report this upstream to the Erlang issue tracker. Monotonic time going backwards is definitely a case for shutting down! At a quick glance I didn’t see this issue already reported.
Are you able to try it in a different virtual environment to rule out this being a virtualbox time issue rather than an Erlang VM issue?
Based on this article
Erlang monotonic time. This is Erlang’s view of the OS monotonic time if available, or the VM’s own monotonic version of the system time if not available. This is the adjusted clock that is used for events, timers, and so on. Its stability makes it ideal to count a time interval. Do note that this time is monotonic, but not strictly monotonic, meaning that the clock can’t go backwards, but it can return the same value many times!
I don’t know if it is possible for VB to report a monotonic time that goes backwards, but if it can, then the VM that is setting its own monotonic time based on that “OS” monotonic time could crash if the time is not actually monotonic.
Switching VM’s would be a pretty big pain. I tried exporting my existing setup to VMWare but couldn’t get the image to work, and really don’t want to build this environment back up again.
I think you’re right that this is a VM issue, not Erlang. I’m guessing there isn’t a way to make Erlang more fault-tolerant to this issue? I’m not experiencing any other issues in my environment with other tooling or apps.
One strange issue is I’ve found this doesn’t affect all Elixir instances. I have my server running and fired up my tests. The tests crashed during compilation due to the time, but the server kept on running no problem. Then I reran the tests and they compiled and ran ok, but the server crashed shortly afterwards. Guessing Elixir/Erlang have to be actively working on something to detect the time error?
Did you look at the Time Warp flags in that post I linked about Erlang time? In particular this seems maybe applicable:
+C single_time_warp This is a special hybrid mode to be used on embedded hardware when you know Erlang boots before the OS clock is synchronized. It works in two phases:
a.(with +c true) When the system boots, the monotonic clock is kept as stable as possible, but no system time adjustments are made
b. (with +c false) Same as no_time_warp
The user calls erlang:system_flag(time_offset, finalize), the Erlang system time warps once to match the OS system time, and then the clocks become equivalent to those under no_time_warp.
Aah, I’m sorry. I read your comment and quote but didn’t follow the link. That sounds exactly like what I want. I have it running right now and no crashes so far.
To summarize what I’m trying, it’s elixir --erl "+c false" -S mix phx.server. The description sounds like exactly what I want:
If the OS system time jumps backwards, the Erlang Monotonic clock stalls until the OS system time jumps back forward, which can take a while.
Erlang’s clock stalling for a bit is much better (in my dev environment) than the whole thing crashing.
Thank you! I’ll report back here if it ends up dying eventually, but it looks promising.
Just curious if the fix was persistent. If so, I’m wondering if someone smarter than me can explain what could be the potential pitfalls of having the BEAM clock stall at random times as dictated by the OS clock. I’m guessing log messages might end up with incorrect time stamps that could make debugging harder. Staying synchronous across multiple nodes I think might be another that is potentially only relevant for certain types of applications. Anything else?
Yeah, still good. It looks like this solved it (usually it would have crashed a hundred times by now).
I’m curious about the implications as well. The clock rolling back is probably a huge red flag normally. I wonder what happens internally in Erlang’s VM when the clock is paused waiting for a successful forward tick.
Might be worth reporting to VirtualBox as a bug, assuming expected behavior for monotonic time in VB is similar to BEAM with no backwards ticks allowed.
At its core the BEAM is a preemptive multi-threaded system. So if the clock is funky then the scheduling will not be fair. If the clock warp too much then I guess some GenServer calls could even timeout.
I enabled all the logs that I could, I cannot understand what causes this issue. For info:
Feb 19 10:42:43 fas myapp[1292]: OS monotonic time stepped backwards!
Feb 19 10:42:43 fas myapp[1292]: Previous time: 417898883425
Feb 19 10:42:43 fas myapp[1292]: Current time: 417898880929
Feb 19 10:42:43 fas systemd-coredump[1570]: Process 1292 (beam.smp) of user 399 terminated abnormally with signal 6/ABRT, processing...
Feb 19 10:42:43 fas systemd[1]: Created slice Slice /system/systemd-coredump.
Feb 19 10:42:43 fas systemd[1]: Started Process Core Dump (PID 1570/UID 0).
Feb 19 10:42:43 fas systemd-coredump[1571]: Removed old coredump core.beam\x2esmp.399.5f2f98555c0e485c811ad1241bae2bc8.3653.1739957532000000.zst.
Feb 19 10:42:43 fas systemd-coredump[1571]: Removed old coredump core.beam\x2esmp.399.5f2f98555c0e485c811ad1241bae2bc8.3903.1739957641000000.zst.
Feb 19 10:42:48 fas systemd-coredump[1571]: Process 1292 (beam.smp) of user 399 dumped core.
Module crypto_callback.so without build-id.
Module crypto.so without build-id.
Module argon2_nif.so without build-id.
Module asn1rt_nif.so without build-id.
Module dyntrace.so without build-id.
Module libgcc_s.so.1 without build-id.
Module libstdc++.so.6 without build-id.
Module libz.so.1 without build-id.
Module libncursesw.so.6 without build-id.
Module beam.smp without build-id.
Stack trace of thread 1330:
#0 0x00007fa6e969aa9c __pthread_kill_implementation (libc.so.6 + 0x92a9c)
#1 0x00007fa6e9648576 raise (libc.so.6 + 0x40576)
#2 0x00007fa6e9630935 abort (libc.so.6 + 0x28935)
#3 0x0000000000455349 erts_exit_epilogue.cold (beam.smp + 0x55349)
#4 0x0000000000539953 erts_exit_vv (beam.smp + 0x139953)
#5 0x0000000000539aaf erts_exit (beam.smp + 0x139aaf)
#6 0x00000000005ce0fe get_os_drift_corrected_time (beam.smp + 0x1ce0fe)
#7 0x00000000005d0b39 erts_get_monotonic_time (beam.smp + 0x1d0b39)
#8 0x0000000000460e40 scheduler_wait (beam.smp + 0x60e40)
#9 0x0000000000470cf9 erts_schedule (beam.smp + 0x70cf9)
#10 0x00007fa6a72445ad n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64
Feb 19 10:42:48 fas systemd[1]: systemd-coredump@0-1570-0.service: Deactivated successfully.
Feb 19 10:42:48 fas systemd[1]: systemd-coredump@0-1570-0.service: Consumed 1.970s CPU time, 5M memory peak, 1000K read from disk.
Feb 19 10:42:48 fas systemd[1]: myapp.service: Main process exited, code=dumped, status=6/ABRT
Feb 19 10:42:48 fas systemd[1]: myapp.service: Failed with result 'core-dump'.
Feb 19 10:42:48 fas systemd[1]: myapp.service: Scheduled restart job, restart counter is at 1.
Feb 19 10:42:48 fas systemd[1]: Starting MyApp...
Considering the clear warning in the Erlang documentation (see below), I do not believe disabling time correction (i.e. +c false) is a solution.
Time correction
[…] You typically never want to disable time correction. Previously a performance penalty was associated with time correction, but nowadays it is usually the other way around. If time correction is disabled, you probably get
bad scalability, bad performance, and bad time measurements.
Does anyone have more information on this? Has anyone reported this to Erlang and/or VirtualBox?
Reading the time and correction module in Erlang was not entirely clear, but I think all the time warping is about the erlang monotonic time and not the OS monotonic time.