Guardian generating an expired token because System.system_time != DateTime. Is this time warp?

I’m using Guardian to handle JWT tokens, and I noticed that after a given amount of time, the client app would not be able to refresh an refresh token into an access token. The request to refresh it would succeed, but the client app would say that the newly generated token was expired, which sure enough it was.

Digging into the Guardian source code, the timestamp is generated with System.system_time(:second), not DateTime.utc_now() |> DateTime.to_unix(), and sure enough, when I try both of those, they are different, by 97 minutes(my access token TTL is 60 minutes).

According to https://www.unixtimestamp.com/, the timestamp right now is roughly:
1685035696

iex(77)> System.system_time(:second)                                            
1685029895
iex(78)> DateTime.to_unix(DateTime.utc_now())                                   
1685035749
iex(79)> 1685035749 - 1685029895
5854

So my System.system_time is nearly 6000 seconds, ~100 minutes off the real time. I’ll admit I don’t have the clearest understanding of time warp, but this seems to happen consistently after about 6-8 hours of running the server. I have not changed timezone. I have not had daylight savings or a clock change. Any idea what could be causing this?

EDIT: In a new IEx console, they are equal again:

iex(1)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
0

On the dev server that’s been running for 12 hours:

iex(96)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
5831

Here are the specifics of the current uptime:

iex(107)> :erlang.statistics(:wall_clock)
{31386845, 9639}
iex(108)> :erlang.statistics(:runtime)
{471130, 23}

.tool-versions

elixir 1.14.4-otp-25
erlang 25.3

This is specified in the documentation of System.system_time/0:

It is the VM view of the os_time/0. They may not match in case of time warps although the VM works towards aligning them. This time is not monotonic.

DateTime.utc_now/0 uses System.os_time under the hood:

@spec utc_now(Calendar.calendar()) :: t
  def utc_now(calendar \\ Calendar.ISO) do
    System.os_time() |> from_unix!(:native, calendar)
  end

In my understanding the VM schedulers are messed up, most probably because it is using some NIFs that don’t return the handler in time, but this is just a thought.

System.os_time does indeed have the correct time. Which confuses me for 2 reasons. 1. Why would System.system_time have any value if its timestamp is not reliable, and 2. Why would Guardian opt to use an unreliable timestamp, given that this is for generating a token to send to another system, that’s not subject to the same time warp. I can’t think if any reason where an inconsistent timestamp is preferred, especially when there’s a consistent one available through System.os_time directly or via DateTime. Maybe there’s another factor I’m not understanding.

I think the issue is somewhere deeper in the VM or on your machine, generally speaking this should never happen, or in the worst case the difference should be a few seconds max.

I understand that there is some deeper cause. You may be right about the NIF claim. I do have some NIFs, and am running some ML models via Nx/EXLA, which uses a lot of NIFs, but I guess my other question still remains, if we have the following two options:

  1. Timestamp 1, usually reliable, but sometimes for certain reasons can drift from the real time, maybe sometimes by a few seconds, but also sometimes by 97 minutes

  2. Timestamp 2, reliable, doesn’t seem to drift from the real time

What would be the reason to ever use timestamp 1?

1 Like

Here is some material of interest https://medium.com/@jlouis666/erlang-dirty-scheduler-overhead-6e1219dcc7 , try to enable dirty schedulers and notice if theres a difference.

Generally speaking I have no idea why a timestamp managed by VM exists. As for why guardian uses that option, I think that they had no idea about this issue with NIFs.

I have updated my NIF to use a dirty scheduler:

use tokenizations::get_alignments;

#[rustler::nif(schedule = "DirtyCpu")]
fn align_tokens(toks_a: Vec<String>, toks_b: Vec<String>) -> (Vec<Vec<usize>>, Vec<Vec<usize>>) {
    return get_alignments(&toks_a, &toks_b);
}

But this NIF can’t have been called more than 20 times since the server was started(most of the time I was sleeping, other than a bit of work last night). And when I run this NIF, it’s not exactly slow.

On the out of time server running on a regular scheduler with a typical input, via Benchee.

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.14.4
Erlang 25.3

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 5 s
memory time: 2 s
reduction time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 10 s

Benchmarking nif ...

Name           ips        average  deviation         median         99th %
nif       132.61 K        7.54 μs   ±117.07%        7.39 μs        8.66 μs

Memory usage statistics:

Name    Memory usage
nif          1.52 KB

**All measurements for memory usage were the same**

Reduction count statistics:

Name Reduction count
nif              157

And on a new server running on a dirty cpu(the code pasted above:

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.14.4
Erlang 25.3

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 5 s
memory time: 2 s
reduction time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 10 s

Benchmarking nif ...

Name           ips        average  deviation         median         99th %
nif        54.39 K       18.38 μs   ±303.40%       13.79 μs       57.21 μs

Memory usage statistics:

Name    Memory usage
nif          1.78 KB

**All measurements for memory usage were the same**

Reduction count statistics:

Name         average  deviation         median         99th %
nif           132.78   ±201.61%            118            238

It’s a fairly mature lib, with 3.3k stars, and nearly 9 million downloads, which always leads me to believe it’s something I’m not understanding rather than the other way around, but certainly from what you’ve said, and what I’ve seen, I can’t think of any reason one would prefer System.system_time to System.os_time in this scenario(or any scenario for that matter).

I appreciate your help and insight. Thank you.

Does this happen only on your M1? Because there were some topics here already about some strange overheads when calling OS binaries.

I’ve only seen it on my M1, but as it’s still in development, it hasn’t had any real usage/runtime outside of the M1. I do have a server it’ll be running on, though it’s not running right now. I’ll start that up, and maybe start a task that trickles calls that would hit this nif and the ML models for 24 hours and see what the server reports. I’ve had this issue happen fairly consistently though locally. It seems to be related to time the server is running, rather than usage, e.g. if I start the server and then ~12 hours later try to make a request from the client, it’ll have the “expired JWT” issue(not because the JWT is 12 hours old, but rather because the newly created access token is expired), until I restart the server, but I won’t get that even after say 4-6 hours of development where the NIF would be getting called more often.

It is also a fairly different from typical setup, because the project relies on a few ML models, it’s running Bumblebee which starts up a couple models via some NIFs and EXLA, erlport which is running some python code, and the Rust NIF from above. Which certainly isn’t the most common use case, but still if a change from system_time to os_time can fix the issue for even a few people, it seems useful. I’ll start a issue on Github to see if there’s a reason for the use of one over the other.

1 Like

Judging from the fact that the previous topic observed that there is some kind of cpu optimisation under the hood, causing those overheads, I have a feeling that the schedulers start to lag because of some cpu throttling, maybe with time we will find what is the actual problem.

I might also just start a plain IEx console and leave that running and see if the time drifts as that would point to it being M1/OS related. I wonder if when the laptop sleeps that could be causing the issue, that’s almost certainly the case each time I notice this. Of course I wasn’t taking detailed notes, but I can’t think of a time where i’ve started the server, and not had some reason that the laptop was sleeping for a few hours, and its usually after that that I notice it, but again this is just from vague recollection.

2 Likes

FYI here’s a link to the Github issue I created on Guardian to see if a change to os_time would be reasonable or not: system_time vs os_time and time drift · Issue #721 · ueberauth/guardian · GitHub

Well, 7 hours later, I don’t think my laptop has been in sleep mode at all as i’ve been on it all day, and even the plain IEx shell with nothing in it, no code loaded, nothing executed, is ~3.5 minutes out of sync :frowning:

iex(6)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
197
1 Like

I have been running my machine with iex open for about 4 hours now (ubuntu 22.04 x86_64):

iex(1)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
0

I would suggest to open a ticket on OTP repository instead of Guardian as this is clearly an issue on M1 devices.

1 Like

Agreed. The underlying issue is almost certainly something in OTP rather than Guardian, but I still think there’s value in switching from system_time to os_time in Guardian, no?

Also, interestingly, my IEx session has drifted back to 0.

iex(1)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
0
iex(2)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
196
iex(3)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
196
iex(4)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
197
iex(5)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
196
iex(6)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
197
iex(7)> 197 / 60
3.283333333333333
iex(8)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
195
iex(9)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
17
iex(10)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
17
iex(11)> DateTime.to_unix(DateTime.utc_now()) - System.system_time(:second)
0
2 Likes

You are 100% right here, using a reliable timestamp, is essential.

Hey folks, a Guardian maintainer here :wave:t2:

First, I would like to clarify that neither I was the one who implemented the feature, nor do I consider myself an expert in any way.

Reading https://github.com/elixir-lang/elixir/blob/9fd97c466301baa74b309f3ea62215a12911c4d8/lib/elixir/lib/system.ex#L39-L50

It is a matter of deciding “who” defines Time. Having the OS level time could introduce problems since you can change back and forth.

Reading more about the topic at System.system_time vs System.os_time - #4 by josevalim

@josevalim perspective (which I agree), it does not matter since, in practice, we may be using UTC anyway.

I think it is prudent to fix the situation and use DateTime for this case!

I agree that there are many situations where system_time is probably fine. The reason that it seems to me more important here for the time used by Guardian to closely match the true time, is that clients use this timestamp. It’s not merely an internal thing where if there is drift it doesn’t matter because an hour is still an hour. If a client is reading the timestamp and that timestamp is valid for the current true time, but because the server’s system_time has drifted it rejects the token. So clients have to code in “Check the timestamp, if it’s valid use it, but if the server rejects it remember that the server sometimes makes up it’s own time, that we don’t know about, so reauthenticate”. It may just be my computer, but I doubt it is. I’ve seen countless times where system_time is up to hours off of the true time. I’ve never seen os_time be any noticeable amount off of the true time. “true time” here meaning time that at least 2 other sources agree upon(usually my phone/computer/and a UTC atomic clock website).

The problem futher compounds when system_time drifts by more than the token TLL because it’s either issued a token that’s already expired as far as the client is concerned. Or a token that’s not yet valid. So the timestamp becomes irrelevent. It could say “3 weeks last tuesday”, but that may be valid if the system_time has drifted that much. I’ve seen it drift up to 6 hours in 24 hours.

1 Like