Understanding queries made by Oban Pro

pwightman · September 20, 2024, 8:31pm

Hi,

We have a Phoenix/LiveView app, with web servers and separate jobs servers that run Oban Pro. There are 4 jobs servers running.

When we look at our database metrics, what appear to be internal Oban queries seem to dominate the load on our database so we’re trying to understand this better.

At all times, but particularly during load spikes, our load visualizer shows this during a 5 minute period:

As you can see, one query at the top is called 33,384 times in 5 minutes, and it’s this query:

SELECT 
    o0."state", 
    o0."scheduled_at" 
FROM 
    "public"."oban_jobs" AS o0 
WHERE 
    (o0."meta" @> $1) 
    AND (o0."state" != $3) 
    AND (o0."id" < $2) 
ORDER BY 
    o0."id" DESC 
LIMIT $4;

I’ve searched our code pretty exhaustively and I’m unable to find this query, so it seems like it might be internal to Oban.

We’re not sure where to find more information on this (my Google-fu fails me, and might be part of Oban Pro?) to understand why it’s being called this much and how to diagnose/reduce the number of calls, if possible.

Anyone have any ideas?

sorentwo · September 22, 2024, 3:39pm

That looks like the query a Chain worker would use in versions of Pro prior to v1.5. That query is called before each chain job runs, and that can really add up with a busy queue.

The good news is that query is gone and the approach is completely different in v1.5. There’s a description of the change and a small demo in one of the launch week announcements.

pwightman · September 23, 2024, 3:53pm

This looks like this could really fix our problem, this is wonderful!

We have a large production app that runs LOTS and lots of jobs, I see that 1.5 is a release candidate, but do you feel like it’s pretty stable at this point? Fully ready for production?

Obviously we’ll do our own testing, but just looking for a gut check on how cautious we should be in this upgrade.

Thank you so much for this response!

sorentwo · September 23, 2024, 5:41pm

There aren’t any known issues at this point and some larger apps, including the ones we manage, are running it in production. It’s fully ready as far as we know!

atomkirk · September 24, 2024, 1:25pm

We deployed 1.5-rc3 and we’re getting this error. Any ideas?

error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicLifeline}}} terminating                                                                                                                                                                       ** (ArgumentError) errors were found at the given arguments:                                                                                                                                                                                                             
                                                                                                                                    
  * 1st argument: not a list                                                                                                        
                                                                  
    (erts 15.0.1) :erlang.length(nil)                                                                                                                                                                                                                                        (oban_pro 1.5.0-rc.3) lib/oban/pro/workflow.ex:1300: Oban.Pro.Workflow.to_rescue_operation/1
    (elixir 1.17.2) lib/enum.ex:1703: Enum."-map/2-lists^map/1-1-"/2                                                                                                                                                                                                     
    (oban_pro 1.5.0-rc.3) lib/oban/pro/workflow.ex:1176: Oban.Pro.Workflow.rescue_workflows/2                                                                                                                                                                            
    (oban_pro 1.5.0-rc.3) lib/oban/pro/plugins/dynamic_lifeline.ex:183: anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2

sorentwo · September 24, 2024, 4:44pm

That’s new to me . It’s mapping the result of a Repo.all, which should always return a list and not nil. Were there any database errors around that time? Did this happen after the initial deploy?

pwightman · September 24, 2024, 5:45pm

It happens every minute (which is the default interval?), so has been persistent since we deployed 1.5.

On the whole our database load is now a fraction of what it was, so we’re so, so grateful, thank you for continuing to improve this library

We have a few odd issues that we’re trying to narrow down. One is what Adam mentioned above. It’s not critical, but is an error we’re getting and we’re not sure why. We have a very vanilla configuration, we’re not passing any options.

We are using some regular Oban.Worker and some Oban.Pro.Worker. Does it play nicely with regular Oban.Worker? Should we convert all Oban.Worker to Oban.Pro.Worker?

Another more critical issue is that jobs that are part of some chains or workflows are having their scheduled_at set to 3000-01-01 00:00:00 sometimes. I’ve been trying to narrow down if it’s something we’re doing but I don’t think it is? Here’s one snippet:

  def workflow(listener, meeting, opts) do
    jobs = jobs(listener, meeting, opts)
    deps = Enum.map(jobs, &"action_#{&1.changes.args.action_id}")

    start =
      Workflow.new()
      |> Workflow.add(
        "extract_facts",
        ExtractFacts.new(%{meeting_id: meeting.id})
      )

    jobs
    |> Enum.reduce(
      start,
      fn %{changes: %{args: %{action_id: id}}} = job, acc ->
        Workflow.add(acc, "action_#{id}", job)
      end
    )
    |> Workflow.add(
      "run_analytics",
      RunAnalytics.new(%{meeting_id: meeting.id})
    )
    |> Workflow.add(
      :done,
      AllAutomationsDone.new(%{meeting_id: meeting.id}),
      deps: ["run_analytics" | deps]
    )
  end

For some reason the AllAutomationsDone job gets a scheduled_at of 3000-01-01 00:00:00. I’m not sure if that’s a placeholder value of some sort, but it never recovers, and it’s just stuck there AFAICT.

We also have other jobs that are not part of workflows, but do pass the chain option, such as:

  use Oban.Pro.Worker,
    queue: :sync,
    max_attempts: 2,
    chain: [by: [:worker, args: :integration_id]]

One last oddity is that all our discarded jobs have disappeared, we are permanently at zero jobs in discarded (which is strange always a few known failures happening). They seem to be getting pruned immediately. Still trying to narrow down what’s going on there, but it is definitely new as of the 1.5 release (where I only made the mechanical changes from the Oban Pro 1.5 upgrade docs, which were very helpful).

Also, is this the best place to report these kinds of issues? Happy to move this elsewhere if needed.

sorentwo · September 24, 2024, 6:55pm

Alright, that’s something to investigate then.

Fantastic, that’s exactly what we hope for!

There are considerations for regular Oban.Worker in many places, but I strongly recommend using Oban.Pro.Worker consistently. That will allow you to use hooks throughout, if nothing else.

That’s intentional and not a problem at all. Jobs in workflows or chains are put “on hold” using a combination of meta and that scheduled_at timestamp. It’s a psuedo-state used to work around the inflexibility of Postgres enums.

Any job that has dependencies will be put in that on-hold state. When its dependencies run it will be made available to run. That’s all from changes made in Pro v1.4, and it should all be running smoothly at this point.

What is your pruning configuration? The Pruner and DynamicPruner will emit telemetry with counts of how many jobs are pruned at each interval. You can use that to verify how many discarded jobs are being deleted.

This is a great place to report these types of issues—hopefully it helps some other people in the future. If you have something more sensitive you’re welcome to email us instead

atomkirk · September 24, 2024, 7:04pm

Heres our dynamic pruner config. We added the discarded key trying to get it to stop deleting our discarded jobs immediately. Please advise:

    {
      Oban.Pro.Plugins.DynamicPruner,
      mode: {:max_age, {14, :days}},
      state_overrides: [
        discarded: {:max_age, {1, :month}}
      ],
      worker_overrides: [
        "Jump.AI.LLM.Jobs.StreamedResponse": {:max_age, {3, :day}},
        ... others just like this ...
      ]
    },

sorentwo · September 24, 2024, 8:31pm

Responded in our email chain. I’ll report back here when we understand what’s going on with pruning discarded jobs for your system.