linusdm

linusdm

How to investigate Oban jobs stuck in 'available' state, across all queues?

Hi! :wave:

This weekend I experienced jobs accumulating in a state available across all queues. We run a two-node setup. One of them was the leader while jobs were accumulating (checking with Oban.Peer.leader?/1), and none of the queues were paused (checking with Oban.check_queue/2). All Oban config is fairly default (the :peer option is not set, two plugins: Cron and Pruner).

Unfortunately I had no time to investigate. Respawning two new nodes (we’re on k8s) solved the issue and jobs were being processed again.

I want to prevent this in the future, but I’m not sure what I can do next time to get to the bottom of this. I checked the troubleshooting guide, but I don’t think there is something in there for this situation (I’m not using PgBouncer for example). I read something about the Stager modes (local vs global), but I’m not sure how to query for that (and if it’s useful information at all).

Most Liked

sorentwo

sorentwo

Oban Core Team

The peer, notifier, and stager events are all helpful for diagnosing whatever happened in your use-case.

sorentwo

sorentwo

Oban Core Team

This seems like an issue with postgres pubsub. A disconnect error in versions prior to v0.20 would leave the Oban.Notifier in an undetectable disconnected state. The node would keep emitting notifications about available jobs, but they wouldn’t broadcast.

There are several ways you could monitor this:

  1. Use checkins in a tracker like sentry or honeybadger. Send an from a cron job every minute and get alerted if they stop running.
  2. Use cron monitors in sentry, which have native support for oban in recent versions.
  3. Poll jobs and send it to an external tool like grafana, datadog, etc. and set up alerts on those. If you’re using the PG notifier, then Oban.Met.latest/2 is a good option. Otherwise, query the actual database every minute or so.
sorentwo

sorentwo

Oban Core Team

Without a running instance it’s difficult to say what the problem may have been. The fact that restarting got things running again points to an ephemeral issue, rather than something with the jobs table itself.

That would be useful information because it would indicate whether the nodes were able to communicate with eachother, and whether the stager “thought” it was notifying queues about available jobs. Staging events are included in the default logger (and is recommended as part of the Ready for Production guide).

Where Next?

Popular in Questions Top

sergio
In Ruby, I can go: User.find_by(email: "foobar@email.com").update(email: "hello@email.com") How can I do something similar in Elixir? ...
New
qwerescape
Is there a way to get the call stack or stack trace at any point in the code? Not from exceptions, but an expression that returns how the...
New
Fl4m3Ph03n1x
About me? ( if you have nothing better to do than reading about some random guy in the internet :stuck_out_tongue: ) Hello all, this is ...
New
lessless
I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there ...
New
minhajuddin
I have seen a lot of code which picks the first element from a list using Enum.at(0) instead of List.first. Is there a reason why people ...
New
Emily
I have VueJS GUIs with the project generated using Webpack. I have Elixir modules that will need to be used by the VueJS GUIs. I fore...
New
hariharasudhan94
lets say i have a sample like a = 20; b = 10; if (a > b) do {:ok, "a"} end if (a < b) do {:ok, b} end if (a == b) do {:ok, "eq...
New
bsollish-terakeet
Credo is smart enough to check for (something like) this: assert length(the_list) == 0 with this response: Checking if an enum is empt...
New
nobody
Hi! In PHP: $SERVER['SERVERADDR'] - in Elixir? Searched the docs for ip address and the web, no good results. Thanks!
New
sergio_101
I am VERY much an elixir newbie. I have taken one elixir course and one phoenix course on Udemy. During that course, I saw the instructor...
New

Other popular topics Top

AstonJ
Posting this to see if we can make things easier for people to get into Neovim. If you use Neovim and have a favourite distro please let ...
New
JorisKok
I have a server on AWS, and was running a load test using artillery. When looking at the Phoenix dashboard I see the Ports going to 100% ...
New
stefanluptak
Hello everybody, usually, I use a 29" ultra-wide monitor for VSCode which can easily accomodate explorer (files panel) + file with code ...
New
dblack
I’ve got an issue with an app and I’ve no idea of how to troubleshoot it. I’m hoping someone here might have seen something similar. I p...
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
joaquinalcerro
Hi there, I am working with Ecto-Postgresql and I need to call all of the records from a specific table but the table has 40,000 record...
New
komlanvi
Hi everyone, I was playing with phoenix liveView but I run into an issue. I have a form and want to validate each input text when the te...
New
Brian
What is the proper way to load a module from a file in to IEX? In the python world, doing something like this pretty standard: from ....
New
axelson
This post is a wiki (feel free to hit the edit button near the bottom right of this post to add your own changes!) This post collects co...
239 47849 226
New
PeterCarter
There are pre-rolled solutions for other frameworks that do work. However, Phoenix does not seem to have these. Have people had good expe...
New

We're in Beta

About us Mission Statement