I am trying to test oban job retries locally, my perform job returns {:error, %{}
, and I see the job is in a retryable state, but it never runs and attempts is still 1.
oban: “~> 2.15.0”
oban_pro: “~> 0.14”
oban_web: “~> 2.6”
I am trying to test oban job retries locally, my perform job returns {:error, %{}
, and I see the job is in a retryable state, but it never runs and attempts is still 1.
oban: “~> 2.15.0”
oban_pro: “~> 0.14”
oban_web: “~> 2.6”
That’s not much info to help you debug. Is this in the development environment? Can you verify that the queue is running with Oban.check_queue(queue: "name_of_queue")
?
Yes this is my local test environment. The queue is running, tries to run the job once and when it errors, just stays in a retryable state.
Oban.check_queue(Manifold.Oban, queue: :delivery_destination_queue) %{ name: "Manifold.Oban", node: "M1-Max", running: [], queue: "delivery_destination_queue", started_at: ~U[2024-08-02 19:54:57.208154Z], updated_at: ~U[2024-08-02 19:55:20.222256Z], global_limit: nil, local_limit: 1, paused: false, rate_limit: nil, retry_attempts: 5, retry_backoff: 1000 }
Is there a custom backoff on the job? When is the scheduled_at
timestamp for it to retry?
no custom backoff
inserted_at: 2024-08-05 15:39:11.287173
scheduled_at: 2024-08-05 15:39:28.327155
attempted_at: 2024-08-05 15:39:11.309171
completed_at: NULL
attempted_by: "{M1-Max,0726da66-d47b-47fd-bc5e-1db1ec3a7e85}"
discarded_at: NULL
priority: 3
tags: {}
meta: "{""span_context"": null}"
cancelled_at: NULL
state: retryable
And this is still happening, even after a reset? Please share the output of Oban.config()
, feel free to omit queue names or cron workers.
Side note—there are many reliability fixes and diagnostic improvements in more recent versions of Oban (not to mention Pro and Web). If possible, I highly recommend upgrading.
iex(1)> Oban.config(Manifold.Oban)
%Oban.Config{
dispatch_cooldown: 5,
engine: Oban.Pro.Queue.SmartEngine,
get_dynamic_repo: nil,
log: false,
name: Manifold.Oban,
node: "M1-Max",
notifier: Oban.Notifiers.Postgres,
peer: Oban.Peers.Disabled,
plugins: [
{Oban.Web.Plugins.Stats, []},
{Oban.Pro.Plugins.DynamicLifeline, [rescue_interval: 300000]},
{Oban.Plugins.Gossip, [interval: 5000]},
{Oban.Pro.Plugins.DynamicPruner,
[
limit: 100000,
mode: {:max_age, {2, :days}},
queue_overrides: [
delete_expired: {:max_age, {1, :day}},
upload_worker: {:max_age, {1, :day}},
file_queue: {:max_age, {1, :day}},
delivery_destination: {:max_age, {3, :day}},
]
]}
],
prefix: "public",
queues: [
delivery_destination: [limit: 1],
periodic_queue: [limit: 1],
process: [limit: 1]
],
repo: Engine.Repo,
shutdown_grace_period: 15000,
stage_interval: 1000,
testing: :disabled
}
Also as an FYI, if I add schedule_in
, to the job option, that job also is never attempted.
See if your stager is alive with Oban.Registry.whereis(Oban, Oban.Stager)
. If it is alive, add this telemetry to see if it reports any errors or other information.
:telemetry.attach_many(:debug, [[:oban, :plugin, :exception], [:oban, :plugin, :stop]], fn _, time, meta, _ ->
if meta.plugin == Oban.Stager, do: IO.inspect({time, Map.delete(meta, :conf)})
end, nil)
iex(2)> Oban.Registry.whereis(Manifold.Oban, Oban.Stager)
#PID<0.3316.0>
iex(3)> :telemetry.attach_many(:debug, [[:oban, :plugin, :exception], [:oban, :plugin, :stop]], fn _, time, meta, _ ->
...(3)> if meta.plugin == Oban.Stager, do: IO.inspect({time, Map.delete(meta, :conf)})
...(3)> end, nil)
20:08:17.173 level=[info] The function passed as a handler with ID :debug is a local function.
This means that it is either an anonymous function or a capture of a function without a module specified. That may cause a performance penalty when calling that handler. For more details see the note in `telemetry:attach/4` documentation.
https://hexdocs.pm/telemetry/telemetry.html#attach/4 line=112 pid=<0.3846.0> file=/Users/hyang/cars/cars_platform/deps/telemetry/src/telemetry.erl mfa=:telemetry.attach_many/4
:ok
{%{monotonic_time: -576460289938607883, duration: 2329583},
%{
telemetry_span_context: #Reference<0.3291262840.1439170561.97292>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
{%{monotonic_time: -576460289931805058, duration: 1137269},
%{
telemetry_span_context: #Reference<0.3291262840.1439170561.97305>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
{%{monotonic_time: -576460289904803716, duration: 2169496},
%{
telemetry_span_context: #Reference<0.3291262840.1439170561.97316>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
{%{monotonic_time: -576460288931119893, duration: 1765531},
%{
telemetry_span_context: #Reference<0.3291262840.1439170561.97352>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
{%{monotonic_time: -576460288929637285, duration: 1357398},
%{
telemetry_span_context: #Reference<0.3291262840.1439170562.51763>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
{%{monotonic_time: -576460288902098684, duration: 1741655},
%{
telemetry_span_context: #Reference<0.3291262840.1439170562.51768>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
{%{monotonic_time: -576460287925900192, duration: 3010885},
%{
telemetry_span_context: #Reference<0.3291262840.1439170561.97430>,
plugin: Oban.Stager,
staged_count: 0,
staged_jobs: []
}}
It’s not staging anything, which means it can’t see any jobs within the time window. Will you output the console time (DateTime.utc_now()
) and the scheduled_at
timestamp from a scheduled job in the database?
You can also try running this:
Oban.Engine.stage_jobs(Oban.config(), Oban.Job, limit: 5000)
And see if it returns a count.
iex(15)> DateTime.utc_now()
~U[2024-08-06 15:38:51.337192Z]
iex(16)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok, [%{id: 350, state: "scheduled", queue: "delivery_destination"}]}
iex(17)>
job is scheduled for: 2024-08-06 15:38:48.665877
The 1 stage_job
we are seeing here is a different job. We send 2 emails and each one goes to a different location. 1 of those email destinations, I am using schedule_in
or failing on purpose, and that is the one that never runs or retries.
After the first email job is completed, the stage_job command returns an empty list:
iex(15)> DateTime.utc_now()
~U[2024-08-06 15:38:51.337192Z]
iex(16)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok, [%{id: 350, state: "scheduled", queue: "delivery_destination"}]}
iex(17)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok, []}
iex(18)>
What’s the attempt
and max_attempts
for that job?
attempt is 0 and max is 3
Nothing appears to be wrong and I don’t see any reason why a scheduled job wouldn’t be ran. Staging is clearly working, as demonstrated by the stage call.
Will you run through this set of queries and see if it matches what you expect?
import Ecto.Query
# Check how many are scheduled or retryable
Oban.Job |> where([j], j.state in ~w(scheduled retryable)) |> select(count()) |> Repo.one()
# Add the queue check
Oban.Job
|> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue))
|> select(count())
|> Repo.one()
# Find the max time
Oban.Job
|> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue))
|> select([j], max(j.scheduled_at)))
|> Repo.one()
# Add the time check
Oban.Job
|> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue))
|> where([j] j.scheduled_at <= ^DateTime.utc_now())
|> select(count())
|> Repo.one()
iex(35)> import Ecto.Query
Ecto.Query
iex(36)> alias Engine.Repo
Engine.Repo
iex(37)> Oban.Job |> where([j], j.state in ~w(scheduled retryable)) |> select(count()) |> Repo.one()
34
iex(38)> Oban.Job |> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue)) |> select(count()) |> Repo.one()
34
iex(39)> Oban.Job |> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue)) |> select([j], max(j.scheduled_at)) |> Repo.one()
~U[2024-08-07 17:04:37.664335Z]
iex(40)> Oban.Job |> where([j], j.state in ~w(scheduled retryable) and not is_nil(j.queue)) |> where([j], j.scheduled_at <= ^DateTime.utc_now()) |> select(count()) |> Repo.one()
34
iex(41)>
So there are 34 scheduled jobs that are ready to go? If you run stage_jobs
as you did earlier, does it stage any of them?
That’s approximately the same query that stage_jobs
uses, so I wouldn’t expect any difference.
looks like it
iex(62)> Oban.Engine.stage_jobs(Oban.config(Manifold.Oban), Oban.Job, limit: 5000)
{:ok,
[
%{id: 1, state: "scheduled", queue: "recalc"},
%{id: 2, state: "scheduled", queue: "recalc"},
%{id: 3, state: "scheduled", queue: "recalc"},
%{id: 4, state: "scheduled", queue: "recalc"},
%{id: 5, state: "scheduled", queue: "recalc"},
%{id: 6, state: "scheduled", queue: "recalc"},
%{id: 7, state: "scheduled", queue: "recalc"},
%{id: 13, state: "scheduled", queue: "recalc"},
%{id: 14, state: "scheduled", queue: "recalc"},
%{id: 16, state: "scheduled", queue: "recalc"},
%{id: 17, state: "scheduled", queue: "recalc"},
%{id: 18, state: "scheduled", queue: "recalc"},
%{id: 19, state: "scheduled", queue: "recalc"},
%{id: 20, state: "scheduled", queue: "recalc"},
%{id: 25, state: "scheduled", queue: "recalc"},
%{id: 26, state: "scheduled", queue: "recalc"},
%{id: 27, state: "scheduled", queue: "recalc"},
%{id: 28, state: "scheduled", queue: "recalc"},
%{id: 29, state: "scheduled", queue: "recalc"},
%{id: 30, state: "scheduled", queue: "recalc"},
%{id: 31, state: "scheduled", queue: "recalc"},
%{id: 32, state: "scheduled", queue: "recalc"},
%{id: 33, state: "scheduled", queue: "recalc"},
%{id: 34, state: "scheduled", queue: "recalc"},
%{id: 35, state: "scheduled", queue: "recalc"},
%{id: 36, state: "scheduled", queue: "recalc"},
%{id: 37, state: "scheduled", queue: "recalc"},
%{id: 38, state: "scheduled", queue: "recalc"},
%{id: 39, state: "scheduled", queue: "recalc"},
%{id: 40, state: "scheduled", queue: "recalc"},
%{id: 42, state: "scheduled", queue: "process_status"},
%{id: 43, state: "retryable", queue: "delivery_destination"},
%{id: 47, state: "scheduled", queue: "process_status"},
%{id: 52, state: "scheduled", queue: "process_status"}
]}
iex(63)>
@houayang did you find a resolution to this issue? I’ve found myself dealing with a similar issue. Following this thread and trying things as prescribed gets me similar results.
If I use Oban.Engine.stage_jobs(Oban.config(), Oban.Job, limit: 5000)
directly the jobs are staged but otherwise they hang out.
Which version of Oban and what notifier are you using? Newer versions will emit a warning in the logger when nodes aren’t connected, and should fall back to a local polling mechanism.