I’m tracing down an issue with Oban. I’ve a staging environment which works correctly and a production environment which has issues. When the Oban.insert_all is used to insert jobs to a queue all jobs are started at once while that specific queue is configured to have a concurrency of 1. One of the differences between staging and production is that there are multiple worker nodes. This gave me doubts about our queue configuration. On the web nodes we have the the queues set to: queues: [] and on the worker node set to: queues: [default: 2, images: 1, ocr: 1].
What is the correct configuration of Oban with a worker/web nodes setup?
Start Oban in the application.ex only on the worker nodes and you can still use Oban.insert_all on web nodes
Start Oban on all nodes (web and worker), but with a different queues configuration
If you’re using free version of Oban, it only has limit: local_limit option, i.e. you have local limit of 2 concurrent jobs in the default queue per worker. So, if you have 6 workers, you will have 12 concurrent jobs, if you have 12 workers you have 24… etc.
If you want to use global_limit, you will probably have to fetch Oban.Pro and use it’s Smart Engine (docs here: Smart Engine — Oban v2.11.0)
I’m aware that the concurrency is a local configuration option. The issue I’m seeing is that when I insert 20 jobs using Oban.insert_all in my setup with 3 worker nodes and a queue with a concurrency of 1 results in 1 node starting all the jobs at once. What I’m expecting to see is 3 active jobs at once.
are you setting up those jobs as unique jobs?
per Oban docs:
The built-in Basic engine doesn’t support unique inserts for insert_all and you must use insert/3 for per-job unique support. Alternatively, the SmartEngine in Oban Pro supports bulk unique jobs and automatic batching.
Option 1 won’t work because you need the Oban supervisor for insert_all to work. The second option, where you disable those queues on your other nodes, is the proper way to do it. The Splitting Queues Between Nodes guide describes a solution for this exact situation, and Pro’s DynamicQueues plugin adds other conveniences.
That doesn’t sound right. Queues on a single node won’t run jobs beyond the concurrency limit. However, if they’re running fast enough, the other nodes may be unable to pick up the jobs. You can check where each job ran in the job’s attempted_by field and the exact time they started/finished at with the attempted_at and completed_at timestamps.
That wouldn’t make any difference. Uniqueness only effects inserting jobs; it has no bearing on how jobs are executed.
Thank you for replying! I’ve used the guide you’ve mentioned to configure the system.
The job takes a few seconds to finish. I’ve copied the data from the oban_jobs table a few days ago (not the best visualisation, null’s are not visible:
It stands out that all attempted_at values have the same timestamp.
Not sure if this is relevant, but in this project we have a cluster, so the nodes can talk to each other, but Oban is configured using the default LISTEN/NOTIFY PostgreSQL functions.
I’ve never seen or heard of a queue ignoring the concurrency limit that way. You may be experiencing a bug fixed in Oban v2.12.1 due to subquery instability during a select for update.
We upgraded to 2.13.3, but unfortunately we’re still experiencing this issue. It looks like it is related to the Oban.insert_all, because we’re only experiencing it with the jobs inserted like that. And maybe only when that queue is empty, because the jobs after that first insert_all runs with a concurrency of 1.
I’m trying to get a reproduction. The non default we have in our applications is that there are two databases and the prepare: :unnamed setting enabled in Ecto. At the moment I do not have a test case where I can reproduce it consistently, but I’ve seen the behaviour a few times.
I think it is not related to the multiple database setup of Ecto. I just saw it again with only the prepare: :unnamed setting enabled and a single database.
It’s common to use prepare: :unnamed, I don’t believe that is an issue.
The output you’ve shared doesn’t include any headers, so I can’t tell which timestamp is which. Here is a short script I wrote to verify queues processes jobs sequentially:
Application.put_env(:oban, Oban.Test.Repo,
name: Oban.Test.Repo,
url: "postgres://localhost:5432/oban_test",
prepare: :unnamed
)
defmodule SomeWorker do
use Oban.Worker
@impl Worker
def perform(%{args: %{"id" => id}}) do
IO.puts "JOB #{id} started at #{System.system_time(:millisecond)}"
Process.sleep(100)
:ok
end
end
{:ok, _} = Oban.Test.Repo.start_link()
{:ok, _} = Oban.start_link(queues: [default: 1], repo: Oban.Test.Repo)
1..9
|> Enum.map(&SomeWorker.new(%{id: &1}))
|> Oban.insert_all()
That generates this output, showing that jobs are started sequentially:
JOB 1 started at 1663158893513
JOB 2 started at 1663158893630
JOB 3 started at 1663158893747
JOB 4 started at 1663158893861
JOB 5 started at 1663158893986
JOB 6 started at 1663158894108
JOB 7 started at 1663158894229
JOB 8 started at 1663158894344
JOB 9 started at 1663158894458
The oban_jobs table shows that the attempted_at and completed_at times are staggered as expected (note it is only showing the time and not the date):
Getting the issue reproduced is hard, but I’m still trying to figure out in which direction the issue is. I’ve changed the insert_all to a Enum.each with an Oban.insert. The issue is still there. I’ve attached a screenshot of the database. The attempted_at timestamp is the same for the executing jobs.
Doubtful, the limit is validated before starting or updating the value on a running queue. Unless you’re using :sys.replace_state/2 to meddle with the queue’s producer, there’s no way to get an invalid limit. To verify the configuration, you can get the queue’s current state with Oban.check_queue/1.
Does it always run the same number of jobs simultaneously? From your screenshot, it looks like there are 24 with the same timestamp.
Right now I have 7 jobs with the same timestamp executing. But I do not see a pattern with the amount happening. It looks like if the queue is empty and new jobs are queued this happens. If new jobs are inserted after the queue is working on jobs the new jobs are executed one by one.
Is there something else I can investigate? I have this situation multiple times per day on our production environment. So I can deploy some extra debug logs for example.