Oban stops processing dynamic queues

anthonator · July 24, 2023, 4:27pm

Hey all,

We encountered an issue over the weekend where Oban (OSS) stopped processing jobs in all of our dynamic queues. Static queues didn’t seem to be affected. I did see the “Jobs or Plugin aren’t Running” section in the troubleshooting guide but we are on a single node so that doesn’t seem relevant to this situation. I also saw this topic in Elixir Forum but it didn’t seem relevant either.

Here is our config:

config :integration_shopify, Oban,
  notifier: Oban.Notifiers.PG,
  plugins: [
    {Oban.Plugins.Gossip, []},
    {Oban.Plugins.Lifeline, []},
    {Oban.Plugins.Pruner, []},
    {Oban.Plugins.Reindexer, [schedule: "0 8 * * *", timeout: :infinity]},
    {Oban.Plugins.Stager, []}
  ]

Just for context. We have a multi-tenant system where each tenant receives it’s own queue. We have over 500 customers using the system so we can have a lot of queues going at once. We typically have no issues other than this and another issues we’re seeing but I will create a separate topic for that.

Has anybody seen this issue before?

anthonator · July 24, 2023, 4:28pm

Also, we are on Elixir 1.14.5 and Oban 2.15.2.

sorentwo · July 25, 2023, 2:06pm

When you say “dynamic queues”, do you mean Pro’s DynamicQueues, or something you’ve set up manually to use Oban.start_queue? I’m guessing the former because of your config.

My hunch is that you encountered a pubsub issue and the midwife never received the signal to start queues. Do you have any logging or errors that you can share?

anthonator:

  plugins: [
    {Oban.Plugins.Gossip, []},
    {Oban.Plugins.Lifeline, []},
    {Oban.Plugins.Pruner, []},
    {Oban.Plugins.Reindexer, [schedule: "0 8 * * *", timeout: :infinity]},
    {Oban.Plugins.Stager, []}
  ]

You don’t need to run Gossip unless you’re using Oban Web. In fact, it can be a little detrimental because it adds pubsub notification overhead.

anthonator · July 25, 2023, 4:57pm

We are just using the open-source version. We are not using Pro.

We start queues using Oban.start_queue/2 when our system starts and when an account is created.

The behavior we’re seeing is the system works fine and then the queues randomly stop processing. No deploys, restarts or anything. Everything is working and then it’s not. I’m not sure if that meshes with your midwife hunch or not.

Thanks for the tip on the gossip plugin. We’ll remove that. We also removed the stager plugin yesterday.

sorentwo · July 25, 2023, 7:37pm

Considering that and the fact that the static queues come back, I think something is crashing and taking down the supervision tree. When the tree returns, the static queues restart and the dynamic ones don’t.

If you’re unable to find and diagnose the crash, then you can switch how you start dynamic queues to use the init event to ensure that they’re restarted later.

anthonator · August 21, 2023, 6:41pm

Thanks for the insight!

We implemented your suggestion around starting the dynamic queues when we receive the init event. Unfortunately, we’ve continued to run into this issue.

I did some additional digging during the most recent incident. Below are my findings.

All dynamic queue processes seemed to still be running. However, none of the dynamic queues were processing.
When I attempted to check_queue/2 the GenServer process would timeout
If I killed the Oban process some queues would restart and others wouldn’t
If I tried to manually start a queue that didn’t come back up after the Oban process restarted it would give me an :ok but we’d continue to receive an error that the process could not be found when doing a check_queue/2
When the Oban process came back up it had the PID #PID<0.961.1190> which I believe means it saw the PID as remote despite only having one node running
The only way I was able to get all queues to start processing again was to restart the server